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ABSTRACT 

A  principal  feature  of  computer  networks  is  the  ability  of  the  various  sites  of  the  network  to 
access  and  update  shared  information.  At  the  application  level,  the  global  information  takes  the 
form  of  shared  file  systems,  databases,  etc.,  and  at  lower  levels,  it  takes  the  form  of  status  infor¬ 
mation  used  in  controlling  the  network.  This  report  focuses  on  techniques  for  maintaining  (i)  the 
availability  of  global  information  in  the  face  of  various  kinds  of  failures  and  (ii)  the  consistency  of 
global  information  in  the  presence  of  concurrency. 

Two  failure  models  are  considered:  the  crash  model,  in  which  failures  are  instantly  detected, 
and  the  malfunction  model,  in  which  an  indefinite  period  of  time  may  lapse  before  failures  are 
detected.  In  the  latter  model,  failed  sites  in  the  network  may  execute  arbitrary  state  transforma¬ 
tions  and  emit  arbitrary  messages;  they  may  exhibit  malicious  intelligence  trying  to  disrupt  the 
functioning  of  the  rest  of  the  network. 

A  network  status  maintenance  scheme  based  on  a  global  clock  facility  is  designed  for  the 
crash  model.  The  global  clock  is  a  system  of  local  clocks,  one  at  each  site.  Using  the  primitives 
provided  by  this  scheme,  desired  consistency  and  availability  attributes  can  be  provided  for 
higher-level  software.  A  feature  of  this  scheme  is  the  guarantee  that  if  site  A  has  site  B  marked 
DOWN  at  a  certain  time,  then  site  B  is  really  DOWN  at  that  time.  An  algorithm  for  updating  a 
replicated  file  is  constructed  using  the  scheme. 

For  the  malfunction  model,  an  approach  to  maintaining  the  correctness  of  global  informa¬ 
tion  and  preventing  error  propagation  is  developed.  A  combination  of  an  assertion-checking  step 
and  a  unanimity-reaching  step  is  used,  along  with  replication  of  the  global  information  at  multiple 
sites,  for  this  purpose.  The  requirements  of  the  unanimity-reaching  step  are  formalized  as  a  more 
general  form  of  the  Byzantine  Generals  Agreement  and  algorithms  for  reaching  this  generalized 
agreement  are  presented. 

Centralized  and  distributed  deadlock  detection  algorithms  are  developed  for  distributed 
databases.  These  algorithms  use  clock  facilities  to  ensure  the  consistency  of  snapshots’  taken  of 
the  status  of  resources  and  transactions  It  is  shown  that  all  genuine  deadlocks  are  detected  and 
no  spurious  indications  of  deadlock  are  given  by  these  algorithms. 
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INTRODUCTION 


1.1.  Definition  of  Globa]  Information 

A  principal  feature  of  computer  networks,  both  an  advantage  and  a 
necessity,  is  the  ability  of  the  various  sites  in  the  system  to  access  and 
update  shared  information.  We  refer  to  such  shared  information -as  global. 
At  the  application  level,  the  global  information  takes  the  form  of  shared  file 
systems,  databases,  etc.  At  lower  levels,  it  takes  the  form  of  control  or 
status  information  for  the  purpose  of  such  system  functions  as  resource 
management,  synchronization,  routing,  reconfiguration,  protection  and  error 
control.  In  this  thesis,  we  will  focus  on  certain  requirements  arising  in  the 
management  of  global  information  and  developing  techniques  for  satisfying 
them.  Our  attention  will  be  restricted  to  distributed  computer  systems  that 
use  message-passing  rather  than  shared-memory  as  the  basic  communica¬ 
tion  mechanism. 

Just  as  with  information  stored  in  non-distributed  systems,  it  is  neces¬ 
sary  to  manage  global  information  in  such  a  way  that  certain  desirable  attri¬ 
butes  are  ensured.  The  major  attributes  are: 

(a)  rapid  accessibility:  it  should  be  possible  to  access  and  update  the  global 
information  with  low  latency  and  high  bandwidth. 

(b)  security:  no  unauthorized  entity  should  be  able  to  access  or  update  the 
information. 

(c)  integrity:  relevant  invariants,  e.g..  functional  dependencies  between 
different  items  of  information,  restriction  to  a  set  of  permissible  values,  etc., 
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should  be  preserved. 

(d)  availability:  in  spite  of  failures  of  parts  of  the  network,  it  shouid  still  be 
possible  to  access  and  update  the  information. 

(e)  consistency:  in  spite  of  concurrent  usage  of  the  information,  every  entity 
that  accesses  the  information  should  be  able  to  form  a  consistent  view  of  it. 

Our  research  is  concerned  with  the  closely-interacting  attributes  of 
availability  and  consistency.  Our  objective  is  to  develop  techniques  for 

(i)  ensuring  the  availability  of  global  information  in  the  face  of  various  kinds 
of  failures,  and 

(ii)  ensuring  that  the  entities  that  access  the  information  get  a  consistent 
view  of  it.  in  spite  of  concurrency. 

1.2.  Availability 

The  availability  of  a  system  is  usually  defined  as  the  probability  that  the 
system  will  be  functioning  normally  at  any  time  during  its  scheduled  working 
period  [BAR  65],  We  can  extend  this  definition  and  say  that  the  availability  of 
a  piece  of  global  information  for  a  given  operation  (  e.g.,  read,  update  )  is  the 
probability  that  an  entity  authorized  to  perform  that  operation  is  able  to  do 
so.  In  order  to  make  this  definition  meaningful,  we  must  include  the  proviso 
that  the  operation  is  carried  out  in  a  manner  that  satisfies  certain  require¬ 
ments.  e.g.,  the  information  obtained  in  a  read  operation  should  not  be  out- 
of-date  or  corrupted  as  a  result  of  prior  failures. 

The  difficulty  of  ensuring  the  availability  of  information  depends  on  the 
kind  and  degree  of  failure  that  must  be  prevented  from  making  the  informa¬ 
tion  unavailable.  We  distinguish  two  models  of  site  failures  -  the  crash  model 
and  the  malfunction  model.  (Lamport  makes  the  same  distinction  in 
[LAM  7Bb]  using  different  terminology.) 
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In  the  crash  model,  if  a  site  fails  at  time  t, 

(i)  the  site  stops  executing  at  t  ■  any  message  it  was  in  the  process  of  send¬ 
ing  is  not  received,  or  if  received,  is  recognized  as  mutilated. 

(ii)  the  internal  state  and  contents  of  the  site’s  non-stable  storage  are  lost; 
the  contents  of  the  stable  storage  remain  intact. 

(iii)  while  in  the  crashed  state,  the  site  does  not  execute  any  operations. 

(iv)  on  recovery,  the  site  knows  that  it  has  been  in  the  crashed  state  and  ini¬ 
tiates  the  appropriate  recovery  procedures. 

(v)  the  site  executes  correctly  between  crashes. 

The  fail-stop  processors  of  [SCK  83]  exhibit  essentially  the  above  charac¬ 
teristics  in  their  failure  behavior.  Consider  a  set  of  sites  |/?J.  called 
receivers,  which  wish  to  obtain  a  piece  of  information  from  another  site  T, 
called  the  transmitter.  If  T  crashes,  it  can  result  in  a  subset  of  |/?J  receiving 
the  information  and  the  remaining  sites  in  }/?j  not  receiving  it.  However,  a 
crash  in  T  cannot  result  in  T  sending  out  incorrect  information  to  any 
receiver  in  {/?{;  crashes  are  benign  failures  in  this  sense. 

In  the  malfunction  model,  failures  are  more  general  and  dangerous, 
since  failed  sites  can  execute  arbitrary  state  transformations  and  send  arbi¬ 
trary  messages.  If  T  malfunctions,  it  may  give  out  incorrect  information  to 
sites  in  |f?{  and  it  may  also  give  out  different  values  to  different  sites  for  the 
same  piece  of  information.  The  protocols  required  to  ensure  that  the 
receivers  reach  some  form  of  agreement  on  the  received  value  are  more 
complex  in  this  case.  A  site  may  malfunction  because  it  has  been  taken  over 
by  a  malicious  agent  or  because  of  some  undetected  hardware  or  software 
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In  order  to  show  how  the  protocols  for  deeding  with  malfunctions  must 
differ  from  those  for  crashes,  wTe  consider  the  example  of  a  transactional  dis¬ 
tributed  database.  Assume  that  the  crash  model  holds  From  the  viewpoint 
of  preserving  the  availability  of  information  in  the  database,  the  following 
protocols  used  in  managing  it  are  relevant: 

(a)  read  and  update  protocols 

A  number  of  protocols  exist  for  performing  the  read  and  update  opera¬ 
tions.  The  selection  of  the  protocols  to  be  used  in  a  design  will  affect  thf 
availability  of  the  information  for  read  and  update  operations.  For  example, 
the  read  protocol  may  involve  reading  any  single  copy  uf  the  information  and 
the  update  protocol  may  involve  updating  all  copies.  Alternatively,  we  could 
read  x  copies  and  update  y  copies  where  (z~y)  is  greater  than  the  tola': 
number  of  copies,  using  a  timestamping  mechanism  to  determine  the  most 
up-to-date  valve  if  some  of  the  x  values  read  were  not  coincident  Yrl 
another  alternative  would  be  to  direct  all  reads  and  updates  to  a  single  cop)  , 
called  the  primary,  which  is  responsible  for  distributing  updates  to  all  the 
other  copies.  These  protocols  ensure  different  degrees  of  availability  for 
read  and  update  operations,  as  well  as  other  attributes  such  a«-  respcus*. 
tame,  communication  costs,  etc. 

(b)  commit  protocols 

Commit  protocols  (such  as  the  2-phase  commit  protocol  [GR.A  78])  art 
used  to  ensure  the  atomicity  of  transactions,  i.e.,  either  all  the  effects  of  th< 
transaction  are  installed  (commit)  into  the  global  information  structures 
concerned  or  none  are  (abort).  From  the  viewpoint  of  preserving  avn  lab 
the  important  question  is  whether  or  not  thm  protocol  is  nonblock- mg 
[SKE  81].  A  commit  protocol  is  said  to  be  non-block. ng  ftv  a  class  of  fa  lure* 
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Consistency  requirements  are  restrictions  placed  on  the  permissible 
interleavings  of  actions  of  the  concurrent  transactions  and  thereby  on  what 
an  atomic  action  may  be  allowed  to  see.  In  the  database  example,  one 
requirement  that  is  usually  made  is  that  a  read  action  of  a  transaction  see 
the  effects  of  all  actions  of  another  transaction  or  see  the  effects  of  none  of 
them  In  the  example  of  the  resource  allocation  table,  a  requirement  may  be 
that  if  a  read  action  of  the  deadlock  detector  sees  the  effect  of  an  acquire 
action  by  a  transaction,  it  should  also  see  the  effects  of  any  resource  release 
executed  by  the  transaction  prior  to  the  acquire  operation.  Otherwise,  an 
inconsistent  picture  of  the  status  of  resources  and  transactions  may  be 
formed  by  the  deadlock  detector,  resulting  in  the  detection  of  false 
deadlocks. 

The  techniques  that  can  be  used  for  ensuring  consistency  in  distributed 
databases  have  been  elegantly  classified  in  [BER  Bl];  they  fall  into  the  two 
broad  categories  of  locking  and  fime-sfomp  ordering.  The  answer  to  the 
question  of  which  technique  is  preferable  for  distributed  databases  with 
given  requirements  must  await  further  research.  [CAR  B3]  suggests,  on  the 
basis  of  investigation  of  a  restricted  set  of  locking  and  timestamp-ordering 
algorithms,  that  the  former  may  be  superior  for  single-site  databases.  On 
the  other  hand,  [GAL  82]  and  [UN  B3]  find  that  timestamp-ordering  is  supe¬ 
rior  for  distributed  databases,  at  least  in  some  environments.  Whatever  the 
answer  may  be  for  databases,  it  can  be  concluded  from  the  available  litera¬ 
ture  that  timestamp-ordering  appears  to  provide  a  more  versatile  and 
efficient  mechanism  for  preserving  the  consistency  of  global  information 
used  in  the  control  of  the  distributed  system.  Examples  of  areas  in  which 
timestamp-ordering  has  been  used  are  database  synchronization  [BER  Bl], 
network  status  maintenance  [HAM  BO],  deadlock  detection  [TSA  82],  dynamic 
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reconfiguration  [MA  Bl],  etc.  In  Chapters  2  and  4,  we  investigate  the  use  of 
clock  facilities  in  network  status  maintenance  and  deadlock  detection 
respectively. 

Some  sort  of  clock  mechanism  has  to  be  available  from  which  the  time- 
stamps  can  be  derived.  Such  a  facility  should  have  the  following  characteris¬ 
tics: 

(i)  it  should  be  distributed  for  reasons  of  reliability,  survivability  and 
efficiency  of  access. 

(ii)  the  facility  should  assign  time  values  which  reflect  the  ordering  of 
events  in  the  computer  system.  For  most  applications,  it  would  be 
sufficient  if  the  values  reflected  the  ordering  of  events  at  a  single  site 
and  the  ordering  of  events  at  different  sites  imposed  by  the  flow  of  mes¬ 
sages. 

(iii)  the  value  of  the  clock  at  a  site  should  be  close  to  the  real-world  time. 
This  in  turn  implies  that  clock  values  at  different  sites  should  not  drift 
appreciably  from  one  another. 

(iv)  Maintenance  of  the  clock  facility  should  be  inexpensive. 

[LAM  78a]  has  proposed  distributed  clock  mechanisms  synchronized  by 
messages  which  satisfy  properties  (i)  and  (ii).  The  mechanisms  require  the 
clock  at  a  site  to  be  advanced  when  a  message  arrives  bearing  a  timestamp 
value  greater  than  the  local  clock  value.  Hence,  all  the  clocks  have  a  ten¬ 
dency  to  catch  up  with  the  fastest  one  among  them.  This  in  turn  may  cause 
them  to  drift  ahead  of  the  real-world  time.  However,  turning  them  back  may 
vitiate  the  required  ordering  of  events.  [BEL  79]  suggests  a  slowing  down  of 
clocks,  when  too  large  a  drift  from  the  real-world  time  is  noticed.  This  would 
provide  property  (iii).  Much  of  the  functionality  required  of  the  facility  could 
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be  implemented  by  hardware  or  microcode  and  this  could  help  to  satisfy  the 
efficiency  requirement  mentioned  in  (iv)  above.  [HAM  BO]  proposes  tech¬ 
niques  for  including  site  crashes  among  the  events  that  the  clock  mechanism 
imposes  an  order  on.  It  relies  upon  the  use  of  probe  messages  by  means  of 
which  a  site  in  the  network  can  ascertain  the  health  of  any  other  site  in  the 
network.  The  defects  of  this  approach  and  an  alternative  solution  are 
covered  in  Chapter  2.  The  problem  of  synchronizing  clocks  in  the  malfunc¬ 
tion  model  is  covered  in  [LAM  Bib]. 

1.4.  Providing  Facilities  for  Availability  and  Consistency 

There  is  a  paucity  of  systems  that  have  implemented  any  but  the  sim¬ 
plest  options  for  providing  availability  and  consistency.  SDD-1  [HAM  BO]  is  one 
example  of  experimentation  with  a  novel  distributed  operating  system  (the 
FelNet),  which  has  attempted  to  provide  timestamp  based  lower-level 
mechanisms  on  the  basis  of  which  the  availability  and  consistency  require¬ 
ments  can  be  fulfilled.  But  more  experience  with  similar  systems,  which 
select  from  the  various  options  for  providing  availability  and  consistency  to 
achieve  viable  combinations,  is  required. 

1.5.  Scope  of  Report 

As  mentioned  above,  it  is  our  belief  that  timestamp-ordering  based  on  a 
global  clock  mechanism  provides  the  most  versatile  basis  that  can  satisfy 
the  consistency  requirements  of  both  application  and  network  control  func¬ 
tions.  In  the  absence  of  failures,  such  a  clock  mechanism  is  simple  to  build 
[LAM  7Ba],  However,  much  of  the  difficulty  in  controlling  distributed  systems 
lies  in  the  problem  of  providing  failure-tolerance,  which  is  closely  intertwined 
with  the  consistency  problem.  As  will  be  seen  in  Chapter  2,  it  is  more 
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difficult  to  construct  a  clock  mechanism  that  assigns  times  to  failure-related 
events  in  such  a  way  as  to  make  it  useful  for  satisfying  consistency  and 
failure-tolerance  requirements.  In  Chapter  2,  we  provide  a  design  for  such  a 
clock  mechanism  for  the  crash  model.  The  design  gives  each  site  a  view  of 
the  status  of  all  the  sites  in  the  network  at  every  instant  of  time.  The  net¬ 
work  status  view  is  used  to  construct  a  solution  to  the  problem  of  updating  a 
replicated  file.  This  solution  involves  keeping  some  of  the  replicas  continu¬ 
ously  up-to-date,  while  the  others  are  updated  periodically.  When  sites  hold¬ 
ing  the  up-to-date  replicas  crash,  they  have  to  be  replaced.  The  synchroniza¬ 
tion  requirements  of  this  solution  provide  a  good  test  of  the  capabilities  of 
the  clock  mechanism. 

In  Chapter  3,  we  address  the  problem  of  preventing  error  propagation  in 
global  information  due  to  malfunctions.  A  more  general  form  of  the  Byzan¬ 
tine  Generals  Agreement  is  formulated  and  methods  for  adapting  it  to  pro¬ 
vide  different  degrees  of  malfunction-tolerance,  according  to  the  criticality 
of  the  global  information,  are  developed. 

Chapter  4  deals  with  deadlock  detection  in  distributed  database  sys¬ 
tems.  The  race  conditions  that  render  most  of  the  algorithms  in  the  litera¬ 
ture  incorrect  are  discussed,  and  algorithms  making  use  of  a  clock  facility 
are  proposed.  This  feature  of  the  algorithms  helps  in  showing  that  all 
genuine  deadlocks  are  detected  and  no  spurious  indications  of  deadlock  are 


CHAPTER  2 


DESIGN  AND  USE  OF  A  NETWORK  STATUS  MAINTENANCE  SCHEME 

2.1.  Introduction 

In  this  chapter  we  propose  a  technique  for  maintaining  information 
about  the  operational  status  of  the  sites  in  a  point-to-point  network,  e.g.,  the 
Arpanet.  We  show  how  this  technique  can  provide  the  basis  for  the  solution  of 
a  control  problem  concerned  with  file  updating  in  such  a  network. 

In  the  network,  as  different  sites  go  out  of  operation  and  recover,  the 
allocation  of  functions  and  tasks  must  be  changed  in  accordance  with  net¬ 
work  status  in  order  to  preserve  the  services  the  network  provides.  For  this 
purpose,  a  view  of  the  status  of  the  various  sites  must  be  obtained  and 
updated  as  time  proceeds. 

In  order  to  co-ordinate  these  functions  and  tasks  as  well  as  to  ensure 
the  consistency  of  the  view  of  system  status,  a  synchronization  mechanism  is 
necessary.  The  mechanism  used  in  our  method  is  a  global  clock  facility. 

In  Section  2.2,  we  discuss  the  issues  concerned  in  status  maintenance 
and  the  proposed  method.  In  Section  2.3.  we  show  how  a  reconfiguration  con¬ 
trol  problem  in  file  updating  can  be  solved  using  this  method. 

2.2.  Network  Status  Maintenance 
2.2.1.  Overview 

Section  2.2.2  discusses  the  requirements  placed  on  the  global  clock 
facility  in  order  that  it  may  serve  as  a  synchronization  mechanism  for  our 
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status  maintenance  scheme,  along  with  previous  work  in  designing  such  a 
facility.  Section  2.2.3  describes  previous  work  in  status  maintenance.  Sec¬ 
tion  2.2.4  develops  the  proposed  method. 

2.2.2.  Requirements  for  the  Global  Clock  Facility 

In  constructing  a  global  clock  facility,  we  must  ensure  that  it  is  con¬ 
sistent  with  the  notion  of  causality.  If  an  event  X  causally  affects  another 
event  Y,  the  global  clock  should  assign  a  greater  time  to  Y  than  to  X. 

Consider  a  failure-free  distributed  system  An  event  can  causally 
influence  other  events  occurring  after  it  at  the  same  site  e.g.  a  write  opera¬ 
tion  on  a  piece  of  data  will  influence  the  result  of  a  subsequent  read  opera¬ 
tion.  Again,  when  a  message  is  sent  from  one  site  to  another,  an  event  occur¬ 
ring  before  the  sending  of  the  message  at  the  first  site  can  influence  events 
occurring  at  the  second  site  after  the  receipt  of  the  message. 

In  order  to  achieve  reliability  and  for  efficient  accessibility,  it  is  desir¬ 
able  to  construct  a  global  clock  out  of  several  local  clocks,  one  at  each  site. 
Events  at  a  given  site  are  assigned  times  using  the  current  value  of  the  local 
clock.  In  order  to  ensure  that  these  assignments  satisfy  the  causal  relation¬ 
ships  among  events  arising  in  the  two  ways  mentioned  above,  Lamport 
[LAM  78a]  proposed  two  rules  which  each  local  clock  should  obey: 

Cl.  At  each  site  i,  the  local  clock  C(i)  is  incremented  between  any 
two  successive  events. 

C2.  If  event  a  is  the  sending  of  a  message  m  by  site  i,  then  the  mes¬ 
sage  contains  a  timestamp,  the  time  assigned  by  C(i)  to  a.  Upon 
receiving  the  message,  site  j  sets  C{j)  to  a  value  more  than  the  max¬ 
imum  of  its  current  value  and  the  timestamp.  TYie  receipt  of  m  is 


‘^;  ■»  "i1  ■_*■-  r^i1'  \.-»  '.“^  \.  *  v^ 


13 


supposed  to  occur  after  the  setting  of  C(j). 

If  these  rules  are  followed,  the  caused  relationships  between  events  will 
be  reflected  by  the  times  assigned  to  them. 

Next  consider  the  case  where  site  failures  are  among  the  events  to  be 
considered.  If  a  site  X  fails,  the  time  at  which  another  site  Y  detects  the 
failure  and  marks  X  as  DOWN  should  be  greater  than  the  time  on  X’s  clock 
when  it  failed.  Similarly  the  time  on  Y's  clock  when  it  marks  X  VP  on  its 
recovery  should  be  greater  then  the  value  on  X's  clock  when  it  recovers; 
however  there  are  some  additional  considerations  relevant  here  which  we 
discuss  below. 

Consider  a  distributed  system  of  two  sites,  X  and  Y.  Assume  that  both 
sites  are  operational  and  that  Y  wants  to  perform  a  read  operation,  to  which 
it  has  assigned  the  time  T,  on  a  local  file.  Further  assume  that  the  result  of 
the  read  operation  at  T  should  reflect  the  effect  of  all  updates  to  the  file 
assigned  times  prior  to  T.  (Note  that  this  requirement  is  not  implied  by 
causality:  while  all  updates  which  do  influence  the  read  operation  are 
required  by  causality  to  have  earlier  times,  not  all  updates  with  times  less 
than  T  are  required  to  make  their  influence  felt  when  the  read  operation  is 
performed  Le.  they  may  be  performed  after  the  read.)  To  satisfy  these 
requirements.  Y  waits  till  it  receives  a  message  from  X  timestamped  greater 
than  T  (if  desired,  it  could  send  a  message  timestamped  T  with  a  request  for 
acknowledgement).  Assuming  that  messages  are  delivered  from  one  site  to 
another  in  the  order  sent,  and  that  a  site  sends  its  messages  in  timestamp 
order,  Y  knows  now  that  it  has  now  received  all  update  messages  originating 
from  X  which  have  update  times  less  than  T.  It  can  perform  all  such  updates 
(  local  and  from  X  )  and  then  perform  the  read  operation. 
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Now  assume  instead  that  X  failed  some  time  prior  to  T  and  is  known  to 
have  done  so  by  Y  when  it  performs  the  read  operation.  If  X  recovers  after 
the  read  operation  and  issues  an  update  timed  less  than  T,  the  results  of  the 
read  operation  will  not  fulfill  the  specified  requirement.  For  this  reason  it  is 
desirable  that  on  recovering,  X  should  set  its  clock  to  a  value  greater  than 
any  at  which  Y  has  it  marked  as  DOWN.  Earlier  we  saw  that  in  order  to 
■atisfy  causality,  the  time  at  which  X  recovers  should  be  less  than  that  at 
which  it  is  marked  UP  at  Y.  Now  we  see  that  X  should  recover  with  a  clock 
setting  greater  than  Fs  clock  value  when  it  marked  X  UP.  These  two 
requirements  can  be  reconciled  by  assuming  that  X  pauses,  Le.  does  only 
null  operations  till  its  clock  value  exceeds  that  at  which  Y  marked  it  UP  (Fig. 
2.1). 

Summarizing,  we  see  that  our  global  clock  facility  should  obey  rules  Cl 
and  C2  and  a  third  rule: 

C3.  If  a  site  i  is  marked  DOWN  at  time  t  at  another  site  ]  then  site  i 
should  not  be  operational  at  that  time  t . 

2.2.3.  Previous  Work  in  Status  Maintenance 

Kuhl  and  Reddy  [KUH  80]  propose  a  scheme  in  which  a  site  is  tested  by  a 
subset  of  its  immediate  neighbors  who  pass  the  test  results  to  the  rest  of  the 
network.  Only  the  test  results  sent  by  those  sites  which  themselves  have 
been  found  to  be  operating  correctly  are  relayed  through  the  network.  The 
deficiencies  of  this  scheme  are: 

(i)  there  is  no  notion  of  time  attached  to  the  test  results  so  that  it  is  difficult 
to  integrate  the  test  results  from  different  sites  in  a  consistent  manner  and 
to  determine  for  what  period  they  are  valid. 
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FIG.  2.1.  EXAMPLE  TO  ILLUSTRATE  RULE  C3 
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(ii)  the  question  of  link  failures,  which  may  cause  correctly  operating  sites  to 
arrive  at  different  conclusions  concerning  the  status  of  a  common  neighbor, 
is  not  considered. 

The  SDD-1  RelNet  [HAM  80]  performs  site  status  maintenance  using  a 
global  clock  facility  which  achieves  the  requirements  described  in  the  previ¬ 
ous  section.  In  this  scheme,  any  site  X  in  the  network  directly  determines 
the  status  of  any  other  site  Y  in  the  network  by  trying  to  communicate  with 
it.  If  no  response  is  obtained  within  a  certain  time,  X  marks  Y  as  DOWN  in 
its  local  status  cable.  A  YOV_ARE J)OWN  message  is  sent  to  Y  in  case  the 
lack  of  response  were  due  to  some  other  cause  than  a  failure  of  Y.  Receipt  of 
this  message  causes  Y  to  cease  operation  in  order  to  comply  with  rule  C 3, 
and  then  to  execute  a  recovery  procedure.  To  ensure  that  Y  actually  gets 
the  YOU JtRE J)OWN  message  it  is  deposited  with  another  site  called  a  guar¬ 
dian  of  y.  with  whom  Y  periodically  checks  for  such  messages. 

The  defect  of  this  status  maintenance  scheme  is  that  sites  may  often  be 
made  to  cease  operation  needlessly.  If  the  network  becomes  congested  at 
some  spots,  timers  will  begin  to  run  out  and  sites  will  become  busy,  stopping 
operation  themselves  and  recovering,  thus  aggravating  the  problem.  A  site 
may  be  too  busy  to  reply  in  time  to  all  the  messages  that  it  may  receive  from 
various  parts  of  the  network,  but  it  may  well  be  able  to  sustain  a  low-level 
protocol  with  its  immediate  neighbors  to  assure  them  that  it  has  not  failed. 
Other  reasons  for  a  site  not  responding  in  time  could  include  being  in  a  criti¬ 
cal  section,  in  a  recovery  procedure,  in  a  high-priority  task,  etc.  To  force  the 
site  to  cease  operation  in  such  situations  is  evidently  not  desirable  In  the 
RelNet,  it  is  possible  that  two  or  more  sites  trying  to  recover  at  the  same 
time  will  force  each  other  to  stop  operation  repeatedly  unless  such  a  situa- 
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tion  is  detected  and  a  random  wait  period  observed  before  trying  to  recover 
again.  The  scheme  of  Kuhl  and  Reddy  has  distinct  advantages  in  that  the 
failures  detected  and  communicated  are  much  less  likely  to  be  spurious. 

Our  method  overcomes  the  deficiencies  of  both  the  above  schemes.  In 
addition,  it  has  a  limited  ability  to  deal  with  network  partitions,  which  nei¬ 
ther  of  the  above  schemes  has. 

2.2  4.  Proposed  Scheme 

2.2.4. 1.  Overview 

In  the  proposed  scheme,  every  site  periodically  broadcasts  the  state  of 
each  communication  link  attached  to  it  to  the  whole  network.  The  state  of 
the  communication  link  may  be  broadcast  as  down  either  because  the  link 
itself  has  failed  or  because  the  site  at  its  other  end  has  failed.  The  state  of  a 
given  site  is  determined  by  other  sites  in  the  network  on  the  basis  of  the 
states  of  all  the  links  attached  to  that  site.  This  requires  putting  together 
reports  from  different  sites,  in  a  consistent  manner.  This  is  done  with  the 
help  of  a  global  clock  facility  which  fulfills  the  requirements  stated  in  Section 
2.2.2. 

Consider  a  network  N  composed  or  two  parts  N i  and  Ag  connected  by 
the  set  of  links  L  (Fig.  2.2).  Suppose  the  sites  in  the  part  A'g  which  are  con¬ 
nected  to  the  links  L  observe  that  the  links  have  failed.  This  information  is 
circulated  among  the  sites  in  Ag.  Suppose  further,  that  the  sites  in  A"i  form 
a  minority  (usually  just  one  site).  Then  the  sites  in  Ag  will  mark  the  site(s)  in 
Af i  as  DOWN  on  the  basis  of  this  information 

It  may  be  that  the  sites  in  N i  have  actually  failed,  causing  the  links  L  to 
appear  to  have  failed  to  the  sites  in  Ag.  On  the  other  hand,  it  may  be  the 


links  that  hcsv*  faded  Further,  no*,  a'.i  the  links  may  have  fa..c-d  siT.xta.-,'- 
oxs’.i.  but  the  news  of  the  recovery  of  some  of  then:  may  nol  have  reached  a:; 
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th*  s.tt  s  in  .Vj  whet.  they  made  the  decision  to  mark  the  sites  in  A’,  DOWS 
It;  th>  latter  two  cases,  the  sites  in  A’,  connected  to  the  links  L  detect  tb» 
fft,,urcs  of  the  links  and  circulate  the  information.  By  this  means,  the  sites 
m  .Vj  real. re  the  possibility  of  them  being  marked  DOH  V  and  hence  ceas* 
operation  in  time  to  comply  with  rule  C 3 

The  sites  in  S\  execute  them  recovery  procedure  as  follows  First  tlm 
sites  that  are  directly  connected  with  operational  sites  in  A‘2  complete  them 
recovery  and  enter  norma!  operation  Then  them  neighbors  who  had  no 
direct  connection  with  an  operational  site  till  then,  are  able  to  start  and 
complete  them  recovery  and  enter  norma!  operation.  This  process  continues 
til!  ail  sites  in  A'j  recover 

The  recovery  of  any  given  site  i  is  performed  by  first  informing  the  net¬ 
work  that  its  links  are  functioning  and  that  it  itself  is  about  to  resume  nor¬ 
mal  operation.  In  broadcasting  this  information,  site  i  should  not  have  to 
wait  for  failed  sites  to  recover  and  acknowledge  that  they  have  received  this 
information  One  possible  way  out  of  this  difficulty  would  be  to  look  at  the 
circulating  information  concerning  link  failures  in  the  network  and  use  it  to 
mark  sites  DOWS  as  above.  Then  site  i  need  only  wait  for  acknowledgement 
messages  from  sites  not  marked  DOWS  stating  that  they  know  about  its 
impending  transition  to  normal  operation.  The  sites  thus  marked  DOWS  are 
assumed  to  mark  every  site  l/Pwhen  they  initialize  themselves  on  recovery. 
However,  this  method  is  a  double-edged  sword  for  site  t.  Other  broadcast 
messages  of  site  and  link  recoveries  occurring  at  the  same  time  may  not 
reach  site  i  since  the  broadcasters  of  these  messages,  who  may  have  site  i 
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marked  as  DO*  V,  would  no*,  have  to  ensure  that  the  broadcast  messages 
rea  h  sit;  i  Hence  site  t  m a)  form  an  incorrect  picture  of  which  sites  are 
DO*  X  and  thus  omul  to  inform  sites  that  are  in  normal  operation  of  its  tran¬ 
sition  to  norma!  operation,  leading  to  a  violation  of  rule  C3.  Our  solution  to 
this  problem  is  for  each  site  in  its  recovery  procedure  to  have  one  or  more 
sites  in  normal  operation  to  serve  as  guards  in  ensuring  that  broadcasts 
reach  site  i.  This  is  in  contrast  to  the  RelNet  technique.  There,  if  a  site  i 
concludes  that  another  site  j  is  DOWS,  site  i  has  a  YOD_JtRE _P0WS  message 
sent  to  site  j.  This  ensures  that  the  latter  ceases  operation  in  time  to  vali¬ 
date  site  i's  mistaken  assumption,  if  it  has  not  really  failed. 

2.2.4. 2  Assumptions 

The  following  assumptions  are  made: 

(i)  The  network  has  a  fixed  topology  with  links  connecting  pairs  of  sites  as  in 
the  Arpanet  and  each  site  knows  this  topology.  This  assumption,  as  well  as 
the  use  of  a  network-wide  broadcast  facility  in  our  scheme,  limits  the  size  of 
the  network  to  which  it  can  be  efficiently  applied.  For  large  networks,  a 
hierarchical  scheme  will  have  to  be  used. 

(ii)  If  partitions  occur,  they  occur  in  such  b  manner  as  to  leave  a  majority  of 
sites  connected.  This  assumption  is  required  because  our  method  handles 
partitions  as  follows. 

When  the  network  gets  partitioned,  the  partition(s)  that  have  a  minority 
of  sites  cease  operation  This  is  done  because  the  sites  in  the  majority  parti¬ 
tion  (  if  one  exists  )  will  mark  the  sites  in  the  minority  partitions  to  be 
DOWN .  Hence,  in  order  to  comply  with  rule  C 3,  the  sites  in  the  minority  par¬ 
titions  must  cease  operation  until  the  partition  is  repaired,  while  the  sites  in 


the  majority  partition  continue  in  operation. 

Thus  if  the  network  partitions  into  more  than  two  pieces  with  each  piece 
only  having  a  minority  of  sites,  all  sites  cease  operation.  Even  when  the  par¬ 
tition  is  repaired  it  is  difficult  for  the  network  to  bring  itself  up  automatically 
after  this  event  for  the  following  reason.  If  a  majority  of  sites  is  always 
operational,  they  enable  a  failed  site  (  or  group  of  sites  ).  on  recovery  to 
make  the  necessary  deductions  about  the  clock  values  with  which  sites  which 
are  still  in  failed  states,  went  down.  The  recovering  sites  are  then  able  to  set 
their  local  clocks  to  values  which  ensure  compliance  with  rule  C3  and  then 
resume  normal  operation.  But  if  all  sites  cease  operation  at  some  time,  it  is 
difficult  for  any  of  them,  when  the  partitions  are  repaired,  to  recover  and  set 
their  clocks  to  such  values  until  all  sites  have  recovered  and  their  clock 
values  have  been  ascertained.  Our  method  does  not  handle  the  problem  of  a 
network  in  which  all  sites  have  ceased  operation,  and  will  have  to  be 
extended  to  deal  with  such  a  situation. 

(iii)  For  similar  reasons,  we  assume  that  the  number  of  sites  that  have  failed 
is  small  enough  to  leave  at  least  a  majority  of  sites  which  are  connected,  in 
operation.  Otherwise  the  same  catastrophe,  namely,  of  all  the  sites  ceasing 
operation,  will  occur. 

(iv)  Sites  are  assumed  to  stop  when  they  fail  i.e.  they  do  not  fail  in  such  a 
way  as  to  execute  their  algorithms  incorrectly  or  exhibit  malicious  behavior. 
Thus  it  is  the  crash  model  of  Chapter  1  that  we  are  assuming. 

2.2.4.S.  Site  and  link  Slates 

A  site  or  link  is  simply  marked  as  VP  or  DOWN  by  every  site  in  the  net¬ 
work  in  the  data  structures  that  it  maintains  to  record  its  view  of  the  system 
state.  However,  in  addition,  a  site  itself  maintains  more  detailed  state 
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information  concerning  itself  as  it  goes  through  the  various  stages  of 
recovery  and  normal  operation.  Similarly,  the  sites  attached  to  a  link  also 
maintain  more  detailed  state  information  concerning  the  link.  In  this  section 
we  summarize  this  detailed  state  information.  The  data  structures  that  are 
used  to  record  system  state  views  are  discussed  in  Section  2.2.4. B. 

Fig.  2.3  shows  the  states  and  state  transitions  that  a  site  may  go  through 
as  recorded  in  the  site  itself.  The  site  enters  the  crashed  state  when  it  actu¬ 
ally  crashes  because  of  a  hardware  or  software  fault  or  when  it  suspects  that 
some  other  site  may  consider  it  DOWN  (Le.  it  crashes  itself).  Sync  and 
pause  are  recovery  states.  In  the  sync  state,  the  site  synchronizes  its  logical 
clock  with  the  clocks  of  neighboring  sites  which  are  in  the  normal  opera- 
tional  state.  In  the  pause  state,  the  site  informs  other  sites  in  the  network 
that  its  links  are  functioning  properly  and  that  it  is  about  to  enter  the  opera¬ 
tional  state.  Only  when  the  site  enters  operational  state  does  the  higher- 
level  software  (e.g.  the  flle-updating  software  described  in  Section  2.3) 
resume  execution.  The  site  may  reenter  the  crashed  state  at  any  instant  for 
either  of  the  two  reasons  given  above. 

Fig  2.4  shows  the  states  and  state  transitions  for  links.  Consider  a  link 
connecting  two  adjacent  sites  i  and  j .  Although  this  is  in  reality  one  bidirec¬ 
tional  link,  the  two  sites  maintain  their  view  of  the  state  of  this  link  in  the 
form  of  the  state  of  the  unidirectional  links  (i,j)  and  (j ,4)  respectively.  In 
the  sequel,  we  will  refer  to  the  actual  bidirectional  link  as  a  bilin k  whereas 
the  unidirectional  links  whose  states  are  recorded  in  the  detailed  state  infor¬ 
mation  alluded  to  above  and  in  the  data  structures  described  in  Section 
2.2  4.8  are  referred  to  as  unilinks.  When  the  bilink  itself  fails  (e.g.  due  to 
hardware  problems  or  noise)  or  when  one  of  the  sites  connected  to  it 
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crashes,  the  state  of  the  corresponding  unilinks  is  set  to  broken  at  both  or 
one  of  the  sites  depending  on  which  of  the  above  situations  exists. 
Ck_unsync  is  a  recovery  state,  in  which  the  clocks  at  the  two  ends  of  the 
unilink  have  not  been  synchronized,  but  the  link  is  physically  in  usable  condi¬ 
tion.  Cfc_yync  is  the  normal  operational  state.  The  link  may  enter  broken 
state  at  any  moment  for  either  of  the  reasons  given  above. 

2.Z.4.4.  The  Clock  Synchrony  Rule 

Let  C(i)  denote  the  local  logical  clock  (there  is  also  a  local  real-time 
clock  to  be  discussed  in  Section  2.2. 4.5.)  at  site  t,  and  N(i)  denote  the  set  of 
immediately  neighboring  sites  of  i.  For  each  site  k  in  N(i)  site  i  maintains  a 
register  LTR(k,i)  which  contains  the  largest  timestamp  attached  to  a  mes¬ 
sage  received  by  site  i  from  site  k  over  the  bilink  between  them.  The  follow¬ 
ing  relation  is  always  maintained: 

CSR:  C(i)  <  A  +  min  {  LTR(k,i):  k  in  A f(i)  and  st  (i,k )  =  o*_yync } 
where  st(i.k )  is  the  detailed  state  of  unilink  (i,*). 

This  implies  that,  if  at  any  instant  two  adjacent  sites  i  and  j  record  the 
unilinks  (ij)  and  (j.i)  respectively  as  in  o kjpync  state,  then  at  that  instant: 

|C(i)-CO)l  <A 

The  local  clock  C(i )  at  a  site  i  may  have  to  be  advanced  for  several  rea¬ 
sons.  It  is  incremented  by  1  for  generating  a  new  timestamp,  and  when  a 
message  arrives  with  a  timestamp  greater  than  the  current  value  of  C(i). 
C(i)  must  be  increased  to  a  value  beyond  the  timestamp,  if  not  already 
greater  than  the  timestamp.  These  advances  of  C(t)  are  required  to  imple¬ 
ment  the  rules  Cl  and  CZ  described  in  Section  2.2.2  to  satisfy  causality.  A 
clock  advance  may  be  necessary  also  when  other  events  occur  e.g.  the 
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timer  which  keeps  C(i)  in  rough  synchrony  with  a  real-time  clock  runs  out,  a 
recovering  site  needs  to  bump  its  clock  ahead  to  synchronize  with  its  neigh¬ 
bors.  etc. 

If  the  advance  cannot  be  made  in  compliance  with  CSR  with  the  current 
values  of  the  LTR  registers,  the  site  t  sends  a  timestamped 
REQ_J1ME_$1GNAL  message  to  the  appropriate  neighbors  depending  on  the 
values  in  the  LTR  registers,  the  current  clock  value  and  the  value  to  which  it 
has  to  be  bumped.  The  neighbors  will  reply  each  with  acknowledgements 
bearing  timestamps  greater  than  the  one  attached  to  the 
REQ_JIME_SIGNAL.  The  acknowledgements  will  increase  the  values  in  the 
LTR  registers,  permitting  C(i)  to  be  advanced.  Note  that  C(i)  cannot  be 
increased  by  more  than  A  at  a  step  so  that  greater  increases  have  to  be  per¬ 
formed  in  multiple  steps. 

2.2  4  5  Synchronizing  with  Real-Time  Clocks 

It  is  desirable  to  keep  each  logical  clock  C(i)  in  rough  synchrony  with  a 
local  real-time  clock.  This  is  necessary  so  that  the  local  clocks  do  not  drift 
apart  to  the  degree  allowed  by  CSR  (two  sites  can  be  as  far  apart  in  their 
clock  readings  as  the  number  of  hops  in  the  shortest  path  between  them 
multiplied  by  A).  Otherwise  frequent  REQ_JIME_SIGNAL  messages  will  have 
to  be  sent  in  order  to  receive  messages  in  accordance  with  rule  C2  and  at 
the  same  time  maintain  CSR.  The  method  of  ensuring  this  rough  synchrony 
depends  on  how  well  the  real-time  clocks  at  different  sites  are  themselves 
synchronized  with  respect  to  each  other.  We  consider  two  cases: 

(i)  Close  Synchrony:  Here  the  real-time  clocks  develop  differences  of  the 

order  of  A  only  over  very  long  periods  of  time.  In  this  case,  the  SDD-1  Rel- 
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net  method  can  be  used  [HAW  80],  Once  after  every  interval  of  duration  tc 
(  say  1  second  )  on  the  real-time  clock,  its  reading  in  seconds  multiplied 
by  a  large  number  N  (  say  108  ),  is  compared  with  the  logical  clock  and  if 
the  reading  of  the  logical  clock  is  less,  it  is  set  to  the  multiplied  value. 
Otherwise  no  action  is  taken.  Usually,  however,  the  increment  in  the  logi¬ 
cal  clock  during  the  period  Te  is  much  less  than  N*tc  and  the  setting  does 
occur.  When  the  real-time  clocks  develop  differences  of  the  order  of  an 
appreciable  fraction  of  A,  they  should  be  resynchronized  in  some  manner. 

(ii)  Loose  Synchrony:  Here  the  real-time  clocks  may  develop  differences  of 
the  order  of  A  comparatively  quickly.  In  this  case,  every  Te  interval  the 
reading  of  the  logical  clock  C(i)  is  stored  away  in  a  location  RC(i).  When 
the  next  such  interval  elapses,  the  logical  clock  reading  is  compared  with 
RC{i)+N*rc  and.  if  less,  is  replaced  by  the  latter.  The  new  value  of  C(i)  is 
stored  in  RC(i).  In  this  way,  a  steady  increase  of  C(i)  with  respect,  to 
real-time  is  obtained  even  if  the  real-time  clocks  are  only  in  loose  syn¬ 
chrony.  Use  of  the  previous  technique  could  result  in  sudden  disruptive 
jumps  in  C(i)  when  messages  from  other  sites  arrive,  if  the  real-time 
clocks  develop  large  differences. 

Increments  in  C(i)  arising  from  the  synchronization  with  the  real-time  clock 
can  be  anticipated  and  REQJIME^JGNAL  messages  sent  out  in  advance  as 
in  the  RelNet  [HAM  BO],  so  that  incrementation  does  not  get  held  up  because 
C(i)  cannot  be  advanced  in  consonance  with  CSR  using  the  current  LTR 
values. 

2.Z.4.6.  The  link  Monitor  Module 

The  basic  function  of  this  module  is  to  probe  each  unilink  periodically, 
and  to  warn  other  interested  parties  when  the  unilink  appears  to  go  dead  and 
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when  it  recovers. 

Each  site  executes  this  protocol  module  on  each  of  its  unilinks  to 
immediate  neighbors.  If  a  neighbor  does  not  respond  in  timely  fashion  to  the 
messages  of  this  protocol,  it  runs  the  risk  of  having  the  unilinks  to  it  marked 
DOWN  and  then  the  site  itself  may  be  marked  DOWN .  This  may  cause  the 
site  to  have  to  cease  operation  in  order  to  comply  with  rule  C3. 

The  messages  of  this  protocol  are  not  timestamped  since  it  is  required 
to  execute  when  the  clock  has  not  been  synchronized  with  the  clocks  of 
neighbors  during  recovery.  Again,  it  may  be  that  the  site  is  waiting  for  an 
increment  in  its  LTR  registers  in  order  to  increase  C(i).  The  protocol  is 
required  to  be  sending  messages  during  this  waiting  period,  too.  Therefore, 
the  protocol  must  have  some  independent  sequencing  mechanism  to  corre¬ 
late  messages  sent  with  their  responses. 

The  protocol  requires  every  site  t  to  periodically  test  each  unilink 
directed  from  site  i  by  sending  a  REQUEST  message  to  which  an  ACK 
response  is  expected  from  the  receiving  site  at  the  other  end  of  the  unilink 
within  some  time-out  period.  If  the  ACK  is  not  received  before  the  timer 
runs  out.  site  i  sets  the  unilink  to  broken  if  it  is  not  already  in  that  state. 
The  probing  of  the  unilink  is  continued  when  the  unilink  is  in  broken  state. 
When  the  site  i  next  receives  an  ACK  to  a  REQUEST,  it  sends  out  a  LINK- 
DOWN  message  to  ensure  that  the  neighbor  realizes  that  the  unilink  from 
site  i  to  it  was  in  broken  state.  To  this  message  a  UNKDOWNJiCK  response 
is  expected.  If  it  arrives  in  the  time-out  period,  the  unilink  ( ij )  is  set  to 
ok^unsync  state.  Symmetrically,  if  a  UNKDOWN  message  arrives  at  site  t,  it 
sets  the  unilink  (i ,j )  to  broken  state,  sends  a  UNKDOWN  message  ir  it  has 
not  already  sent  one  to  which  a  response  is  pending,  and  then  sends  a 
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LINKDOWN_£CK  response.  The  probing  of  the  unilink  goes  on  in  the 
ok_unsync  state,  also. 

If  a  message  not  related  to  this  protocol  arrives  over  a  unilink  0,*)  to 
site  i  while  it  has  the  unilink  (i . j )  marked  as  broken  then  the  message  is 
suppressed.  Also  site  i  itself  is  prevented  from  sending  a  message  over  the 
unilink  (i.j)  while  it  is  in  broken  state. 

The  link  monitoring  module  provides  a  facility  by  which  an  interrupt  is 
generated  to  any  process  in  site  i  which  has  requested  to  be  informed  when  a 
unilink  (i ,j )  goes  into  broken  state  or  comes  back  to  ok_unsync  state.  If  the 
unilink  is  already  in  the  state  specified  an  immediate  return  is  provided.  Fig. 
2.5  shows  the  unilink  state  transitions  in  which  the  link  monitor  module  in 
involved. 

2.2. 4. 7.  The  link  State  Reporter  Module 

The  function  of  this  module  at  a  given  site  is  to  watch  the  state  of  the 
unilinks  directed  from  the  site  and  to  broadcast  the  state  to  the  network. 

Every  t i  ticks  of  the  local  logical  clock,  at  a  site  <  in  the  operational  or 
the  pause  state,  this  module  broadcasts  the  state  of  all  the  unilinks  directed 
from  it  with  a  timestamped  message. 

In  addition,  when  a  unilink  is  discovered  to  have  gone  from  ok^ync  state 
to  broken  state,  a  fresh  status  report  is  generated  at  once  and  broadcast. 
This  is  done  so  that  a  site  that  crashes  is  quickly  detected  to  have  done  so. 

The  link  state  broadcast  is  transmitted  through  the  network  as  a  high- 
priority  message  using  the  technique  of  flooding.  Consider  a  site  t  which 
initiates  a  link  report  broadcast  at  a  local  lime  t.  Suppose  there  exists  a 
path  of  n  unilinks  from  site  t  to  site  j.  Assume  that  at  time  t  -A,  all  the  sites 
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on  that  path  are  in  pause  or  operational  state  and  the  unilinks  are  in  ok _sync 
state.  Consider  the  subpath  of  this  path,  r  hops  long  (  lSrSn  ),  excluding 
site  i  but  including  the  other  end  site.  If  for  all  r,  in  the  subpath  of  length  r, 
all  the  unilinks  continue  in  ok_sync  state,  and  all  the  sites  continue  in  one  of 
the  two  site  states  mentioned  above  till  t  +rA,  then  the  report  will  reach  site 
j  by  t  +n  A. 

This  is  achieved  by  each  intermediate  site  relaying  the  received  report 
within  a  period  A  of  the  LTR  value,  immediately  after  receipt  of  the  report, 
for  the  site  from  which  the  report  came,  on  its  oJt _sync  unilinks  to  other 
sites.  This  procedure  along  with  adherence  to  the  CSR  relation  guarantees 
the  arrival  of  link  status  reports  satisfies  the  time  bounds  described  above. 

The  state  of  a  unilink  is  broadcast  as  UP  if  it  is  in  ok _sync  state,  and  as 
DOWN  otherwise.  How  these  link  status  reports  are  used  to  update  the  net¬ 
work  status  views  and  to  decide  if  a  site  should  crash  itself  is  described  in 
the  next  section. 

2.2.4. B.  The  CRASH-OTHERS  and  CRASH-SEUF*  Modules 

Basically,  a  site  t  marks  another  site  j  or  a  group  of  sites  containing  site 
j  DOWN  when  site  i  has  marked  all  the  unilinks  to  the  site  or  group  of  sites 
DOWN .  Unless  precautions  are  taken,  the  site  j  thus  marked  DOWN  may 
actually  be  in  operational  state  thus  violating  rule  C3.  For  example,  it  may 
have  been  partitioned  by  physical  failure  of  the  bilinks  corresponding  to  the 
unilinks  marked  DOWN  but  may  not  itself  have  crashed.  Further,  even  the 
partition  may  not  have  actually  occurred  but  the  news  of  some  of  the  unil¬ 
inks  having  gone  back  into  ok_$ync  state  from  broken  state  may  not  have 
reached  site  i  in  time. 
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Therefore  a  site  j  must  have  a  mechanism  by  which  it  can  anticipate  the 
possibility  of  it  being  marked  DOWN  and  cease  operation  in  time  and  execute 
a  recovery  procedure. 

To  this  end,  a  site  j  maintains  two  graphs  CRASH_OTHERS(j)  and 
CRASH_SELF\j).  The  first  is  used  to  detect  when  site  j  should  mark  other 
sites  DOWN  and  the  second  when  it  should  crash  itself.  In  each  graph  there 
is  a  node  for  every  site  in  the  network.  If  a  bilink  connects  sites  i  and  j  in 
the  network  then  there  are  two  directed  arcs  (i.j)  and  (J.i)  in  each  graph  for 
the  corresponding  unilinks.  In  each  graph,  for  each  node  and  each  arc  there 
is  a  STATE  field  and  a  TIME  field.  When  a  link  state  report  arrives  at  a  site, 
it  updates  its  graphs  in  the  manner  described  below.  Note  that  an  update  ( 
which  consists  of  a  jsfa/e  ,fime  j  pair  )  is  effective  only  if  the  time  field  in  the 
update  is  greater  than  the  TIME  field  for  the  node  or  arc  in  the  graph  being 
updated,  otherwise  neither  the  STATUS  nor  the  TIME  field  is  changed. 

(a)  if  unilink  (p.q )  is  reported  as  DOWN  by  site  p  in  a  link  state  report 
timestamped  t  then  the  {STATE .TIME)  fields  for  arc  (p.q )  are  set  to 
(DOWN.t)  in  both  CRASH JPTHERS{i )  and  CRASH J>ELF(i ) . 

(b)  if  unilink  (p.q)  is  reported  as  UP  by  site  p  in  a  link  state  report  times- 
tamped  t  then  the  {STATE, TIME)  fields  for  arc  (p.q)  are  set  to  {UP,t) 
only  in  CRASH_OTHERS{i). 

Before  we  describe  under  what  conditions  a  site  is  marked  DOWN  we 
introduce  some  notation.  Let  NC={V,£'j  be  the  undirected  graph  of  the  net¬ 
work.  i.e.  it  has  a  node  for  every  site  in  the  network  and  there  is  an  arc  con¬ 
necting  nodes  p  and  q  in  NG  if  there  is  a  bilink  between  the  corresponding 
sites  in  the  network.  A  component  C  of  NG  is  a  subset  | Ve.Ee]  of  NG  such 
that  (a)  every  node  in  Vc  is  reachable  from  any  other  node  in  Vc  through  a 
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path  in  NG  containing  only  nodes  in  Vc  and  (b)  Ec  consists  of  all  arcs  in  E 
which  connect  nodes  in  Ve  . 

A  node  in  Ve  which  satisfies  the  condition  that  at  least  one  arc  exists  in 
NG  connecting  it  to  a  node  not  in  Vc  is  called  a  boundary  node  of  C.  B(C)  is 
the  set  of  boundary  nodes  of  C.  A  node  not  in  Ve  that  satisfies  the  condition 
that  at  least  one  arc  exists  in  NG  connecting  it  to  a  node  in  Ve  is  called  a 
neighbor  node  of  C.  N(C)  is  the  set  of  neighbor  nodes  of  C. 

The  testing_arcs  DTC  of  component  C  are  the  set  of  directed  arcs  run¬ 
ning  from  N(C)  to  B(C)  in  CRASHES  ELF  or  CRASH _OTHERS.  The 
self _testing_arcs  STC  of  component  C  are  the  set  of  directed  arcs  running 
from  B(C)  to  N(C)  in  CRASH_SELF or  CRASH_OTHERS  (Fig.  2.6). 

The  significance  of  the  testing  arcs  and  the  self  testing  arcs  of  a  com¬ 
ponent  is  as  follows.  Assume  |  Ve  |  <  where  n  is  the  number  of  nodes  in 

NG.  When  a  site  j  outside  C  has  all  the  arcs  in  OTc  marked  DOWN  in 
CRASH_OTHERS(j),  it  marks  all  the  sites  in  C  as  DOWN.  Let  be  the 
largest  of  the  TIME  fields  for  these  arcs  in  CRASH_PTHERS(j).  Site  j  marks 
every  site  in  C  DOWN  with  a  TIME  field  greater  than  or  equal  to 
*tw=*m*i+l  K  I A  In  order  to  comply  with  rule  C 3,  every  site  in  C,  if  it  has 
not  really  crashed,  must  crash  itself  by  It  will  be  shown  that  every  site  j 
in  C  will  find  all  the  self  testing  arcs  of  C  (  or  a  subcomponent  of  it  )  to  be 
marked  DOWN  in  CRASH JELF\j)  by  tps-  This  condition  is  the  signal  for  site 
j  to  crash  itself.  These  preliminary  remarks  should  help  in  understanding 
the  CRASH_OTHERS  and  CRASH_J>ELF  algorithms  given  below. 

The  module  for  marking  sites  DOWN  in  CRASH_f)THERS{j)  is  invoked 
whenever  a  link  state  report  arrives  at  site  j  declaring  a  unilink  (p.q)  to  be 
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DOWN  where  node  q  is  not  already  marked  DOWN  in  CRASH~pTHERS(j). 
This  module  executes  as  described  below: 

(a)  find  a  component  C,  if  one  exists,  including  node  g  but  not  node  j  such 
that: 


0)  IV. 


7j-j,  n  being  the  number  of  nodes  in  NG. 


(ii)  for  all  l  in  OTt,  STATE(l)=DOWN  in  CRASH J?THERS{j ) . 

(b)  if  such  a  component  C  is  found,  let  t g^^maxlTIMEiiy.l  in  07^}.  Then 

(i)  bump  C(J),  if  necessary,  to  a  value  greater  than  tnv =<■...+ 1  Vt  )A. 
not  responding  to  any  LINKUP  or  SITEUP  messages  in  the  interim. 
(The  latter  are  messages  related  to  recovery  procedures  to  be 
explained  in  Section  2.2.4.10). 

(ii)  mark  every  node  r  in  Ve  not  already  marked  DOWN  by  setting  the 
(STATE, TIME)  fields  to  (DOWN,C(J))  in  CRASH_DTHERS  and 
CRASH_f>ELF. 


The  module  to  detect  if  site  j  should  crash  itself  is  invoked  whenever  a  unil¬ 
ink  (p.q )  which  was  UP  in  CRASH_$ELF\j )  is  set  to  DOWN  as  a  result  of 
receiving  a  link  state  report  from  p.  This  module  executes  as  follows: 

Find  a  component  C,  if  one  exists  including  nodes  p  and  j  such  that 

(ii)for  all  l  in  STe,  STATE (l)=DOWN  in  CRASH_J>ELF\j ). 

If  such  a  component  exists,  enter  the  crashed  state. 

This  procedure  must  be  completed  before  C(J)  exceeds  a  value  A 
beyond  the  LTR  value,  immediately  after  receipt  of  the  link  status  report. 


for  the  neighbor  from  whom  the  report  was  received. 

Table  2.1.  summarizes  the  variables  stored  at  site  i,  for  ease  of  refer¬ 
ence. 

2.2.4.Q.  Correctness  Arguments  (I) 

Before  stating  the  recovery  procedures  for  links  and  sites,  it  is  of 
interest  to  show  that  the  above  algorithms  work  if  the  unilinks  which  enter 
broken  state  and  sites  which  enter  crashed  state  never  leave  those  states. 
This  will  also  help  in  understanding  the  correctness  of  the  algorithms  after 
recovery  procedures  have  been  incorporated. 

Thm  1:  If  site  p  has  site  q  (*p)  marked  as  DOWN  in  CRASH _PTHERS(p)  at 
local  time  f,  site  q  is  in  crashed  state  at  time  t  (i.e.  enters  crashed  state 
before  time  t ). 

Proof:  Let  tMat  be  the  time  when  site  p  marked  site  q  DOWN  along  with  the 
other  sites  in  the  component  C.  For  each  l  in  OTe,  let  Tj  be  TIME(L)  in 
CRASH_OTHERS(p)  at  the  time  the  sites  in  component  C  was  found  to  be 
suitable  for  marking  DOWN  and  let  fij  and  6j  be  the  neighboring  and  boun¬ 
dary  nodes  of  C  to  which  l  is  attached. 

INDUCTION  HYPOTHESIS  HI:  For  each  l  in  0Te,  every  site  m  in  C  if  still 
operational  at  time  7j+A:A,  fc=  1,2...,  |  Ve  |,  has  at  least  one  arc  in  every  path 
of  length  k  through  nodes  in  C  to  n*  marked  DOWN  in  CRASH_SELF\m)  by 
that  time. 

BASIS:  HI  is  true  for  fc  =  l  since  CSR  and  the  link  monitor  mechanism  ensure 
that  the  site  bt  has  l'  marked  DOWN  in  CRASH_PTHERS(bt )  (where  V  is  the 
arc  running  from  bt  to  rij)  by  (7j+A)  if  it  is  still  operational  at  that  time. 
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Variable  Name 

Description 

C(i) 

the  logical  clock  value 

LTR(k.i) 

the  largest  timestamp  attached  to  a 
message  received  by  site  t  from  site  k 
over  the  bilink  between  <  and  k.  k 
ranges  over  the  neighbors  of  site  i. 

st  ( i.k ) 

the  detailed  state  of  unilink  (i.k),  which 
may  take  the  values  broken,  ok_jmsrync 
or  oJfc_synr .  k  ranges  over  the  neigh¬ 
bors  of  site  i. 

state  (i) 

the  state  of  site  t.  which  may  take  the 
values  crashed,  sync,  pause  or  opera¬ 
tional. 

CRASH_OTHERS  (i )  and 
CRASH _$ELF(i) 

directed  graphs  with  a  node  for  each 
site  and  2  arcs  for  each  bilink  in  the 
network.  Each  node  and  arc  has  an  as¬ 
sociated  STATE  and  an  associated 
TIME  field.  The  STATE  field  takes  the 
values  UP  or  DOWN. 

issued  (i) 

a  variable  in  stable  storage  set  to  the 
value  of  C(i)  every  X  ticks. 

Table  2.1.:  Variables  stored  at  site  i. 


Assume  HI  true  for  k  =  y  (y<  |  Ve  | ). 


Consider  a  node  g  in  C  that  has  a  path  P  of  length  y+1  containing  only  nodes 
in  C  to  rif.  The  next  node  h  on  this  path  has  a  path  P*  of  length  y  to  n*  which 
is  P  minus  the  arc  from  g  to  h. 

By  our  inductive  assumption  on  Hi,  site  h  has  an  arc  on  P*  marked  DOWN  by 
Tj+yA  if  it  is  still  operational  at  that  time. 

Hence,  by  Tj+(y  +  1)A,  site  g  has  marked  DOWN  either  the  arc  from  node  g  to 
node  h  or  else  the  arc  in  P'  which  was  marked  DOWN  by  site  h.  by  time 
7j+yA,  since  this  information  would  be  relayed  to  site  h  by  7j+(y  +  l)A. 

Hence  HI  is  true  for  4=y+l  and  hence  for  4  =  1,2,..  |  Vc  |. 

But  a  node  in  C  can  only  have  paths  of  length  at  most  |  Vc  |  through  nodes  in 
C  to  nt  for  all  l  in  OTc . 

Hence  there  exists  a  component  C  containing  q  such  that  Ve-  is  a  subset  of 
Vc  and  for  all  U  in  STt  .  STATE (U)= DOWN  in  CRASH_SELF\p)  before 
maxJTjl  in  OTc  j  +  |  Vc  |Asf«.  Hence  site  q  will  crash  itself  before  f,«f. 


The  next  theorem  gives  a  sufficient  condition  under  which  a  site  will  not 
have  to  crash  itself.  We  define  a  component  C  as  working  in  the  time  inter¬ 
val  (fj.fg)  if. 

(i)  I  H  I >|gj 

(ii)  Sites  corresponding  to  nodes  in  C  are  operational  at  and  suffer  no 
hardware  or  software  failures  resulting  in  their  crashing  in  the  given  time 
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interval. 

(iii )  There  exists  a  subset  of  Ec,  E ,  such  that 

(a)  the  graph  (VC,E)  is  connected. 

(b)  for  each  arc  in  E,  both  unilinks  corresponding  to  the  arc  are  recorded 
as  in  ok_sync  state  at  1 1,  and  neither  suffers  any  physical  failure  over  the 
given  time  interval. 


Thm  2:  If  a  component  C  works  over  (f  j.fg),  no  site  in  C  has  to  crash  itself  in 
this  time  interval. 


Pfoof:  If  possible,  let  one  or  more  sites  in  C  crash  themselves  in  this  interval, 
l^t  p  be  tht  site  in  C  which  is  the  earliest  to  crash  itself.  Let  Cr  fcs  the  com¬ 
ponent  which  fulfilled  the  requirements  for  p  to  crash  itself.  Since 

In! 


K;< 


.  there  must  exist  a  node  J^p  in  C  which  is  also  in  N(C')  and  a 


path  consisting  of  nodes  and  arcs  in  ( Vt,E ’)  to  node  q.  Let  r  be  the  node  in 
this  path  in  B(C*).  Since  p  is  the  first  site  in  C  to  crash  itself,  site  q  is  still 
operational  at  the  time  p  crashes  itself.  Hence  the  unilink  (r,q)  cannot  have 
been  marked  DOWN  as  a  result  of  q  going  into  crashed  state  and  site  r 
reporting  the  unilink  to  it  as  DOWN  consequently.  Also  the  bilink  correspond¬ 
ing  to  this  unilink  suffers  no  physical  failure  in  the  given  interval.  Thus  there 
is  no  sequence  of  events  that  could  cause  this  unilink  to  be  marked  DOWN  in 
CRASH_$ELF\p).  Thus  C *  does  not  fulfill  the  conditions  for  site  p  to  crash 
itself,  contradicting  our  assumption. 
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2.2.4.10.  Recovery  Procedures 


2.2.4.10.1.  Overview 


In  this  section,  we  describe  the  recovery  procedures  for  sites  and  links. 
The  procedures  executed  by  a  site  as  it  recovers  from  crashed  state  through 
sync  and  pause  states  to  operational  state  are  described  and  the  recovery 
procedures  for  links  are  described  in  this  context  since  link  recovery  is  part 
of  site  recovery. 

In  order  to  motivate  the  rest  of  this  section,  we  first  briefly  summarize 
the  recovery  procedure  as  executed  by  site  i. 

In  the  crashed  state,  the  site  i  sets  its  clock  to  a  value  greater  than  it 
ever  had  hitherto.  For  this  purpose,  it  makes  use  of  a  variable  called 
issued  (i)  kept  in  stable  storage,  which  is  always  maintained  at  most  X  behind 
the  clock  value. 

In  the  sync  state,  the  site  synchronizes  one  or  more  of  its  bilinks,  i.e., 
its  clock  is  brought  within  A  of  the  clocks  of  the  corresponding  neighbors. 
Next  it  appoints  one  or  more  neighbors  as  guards  to  ensure  that  broadcasts, 
occurring  in  the  network  from  now  on  till  it  enters  operational  state,  reach  it 
even  though  it  may  be  recorded  as  DOWN  by  the  broadcasting  sites  in  this 


interval. 

In  the  pause  state,  the  site  broadcasts  news  of  the  recovery  of  one  or 
more  unilinks  (t.fc)  through  LINKUP  messages.  When  all  the  sites  that  are 
maintaining  CRASH_PTHERS  graphs  at  the  time  acknowledge  that  they  have 
marked  the  unilink  (1 fc.i)  UP  in  their  CRASH_PTHERS  graphs,  site  t,  through 
a  UNKSAFE  message,  broadcasts  the  information  that  they  may  now  mark 
unilink  (i.k)  UP  in  their  CRASH_$ELF  graphs. 
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Here  a  complication  crops  up.  A  unilink  I  by  failing  at  a  critical  time 
may  delay  the  receipt  of  unilink  status  reports  which  would  have  caused  a 
site  j  to  crash  itself.  In  the  case  of  no  recoveries  considered  earlier,  this 
posed  no  problem.  As  we  found  in  the  proof  of  Theorem  1,  the  failed  unilink  l. 
by  being  itself  marked  DOWN  at  site  j  effectively  'substituted'  for  unilinks, 
the  news  of  whose  failure  is  delayed  in  reaching  j  as  a  result  of  f's  failure. 
Hence,  site  j  still  crashed  itself  in  time.  In  the  environment  we  are  consider¬ 
ing  now,  in  which  recoveries  do  occur,  f's  failure  may  cause  a  delay  in 
reports  reaching  j  but  f  may  then  recover  and  be  marked  UP  in 
CRASH_£>ELF\j),  thus  ending  the  substitution.  Hence  site  j  may  not  crash 
itself  when  it  should  have. 

For  this  reason,  when  site  <  collects  acknowledgements  for  the  LINKUP 
message  for  a  unilink  (i.Jfc),  it  ascertains  which  unilinks  are  marked  DOWN  in 
CRASH_SELF  graphs  in  the  network  (  and  news  of  whose  failures  may  have 
been  delayed  in  reaching  sites  as  a  result  of  the  failure  of  (i,k)  ).  In  the  sub¬ 
sequent  LINKSAFE  broadcast,  the  identities  of  these  unilinks  are  included, 
and  every  site  on  receiving  the  LINKSAFE  marks  them  DOWN  in  its 
CRASHJEELF  graph.  Thus,  when  a  substituting  link  is  marked  UP  in  the 
CRASH_SELF  graph  of  a  site,  that  site  also  receives  the  information  regard¬ 
ing  failures  of  unilinks  which  was  delayed  in  reaching  it  as  a  result  of  the 
failure  of  the  substituting  unilink. 

We  return  to  the  sequence  of  site  i's  recovery  actions.  The  site  i  broad¬ 
casts  news  of  its  impending  transition  into  operational  state  with  a  SITEUP 
message.  After  collecting  acknowledgements  from  all  sites  maintaining 
CRASH_PTHERS  graphs  that  they  have  marked  site  t  UP,  it  discards  its 
guards  and  enters  operational  state.  At  this  point,  execution  of  higher-level 
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software,  which  makes  use  of  the  status  maintenance  scheme,  may  be 
started. 

We  now  give  a  detailed  description  of  the  above  steps. 

2.2.4.10.2.  crashed  -»  sync 

In  the  crashed  state,  the  recovery  procedure  consists  in  setting  the  local 
clock  to  a  value  greater  than  any  it  ever  had  before  (  ensuring  the  monotoni¬ 
city  of  the  clock  through  crashes  )  and  to  wait  till  at  least  one  neighboring 
site  in  operational  state  is  discovered. 

Each  site  i  has  a  variable  issued(i)  in  stable  storage  [1AM  76].  (Writes 
to  stable  storage  are  atomic  and  the  contents  persist  through  a  crash.)  On 
recovery  from  crashed  state,  the  site  i  adds  X  to  issued(i),  sets  C(i)  to  this 
value  and  sets  issued (i)  to  this  new  value,  too.  From  then  on,  issued{i)  is 
updated  every  X  ticks  of  C(i)  to  the  new  value  of  C(i). 

Next,  it  sets  its  attached  unilinks  to  broken  and  activates  the  link  moni¬ 
tor.  It  then  waits  till  at  least  one  of  the  unilinks  directed  from  it  is  set  to 
ok_unsync  state  by  the  link  monitor.  It  periodically  sends  out  a 
STATUS_REQ  message  on  all  the  ok_unsync  links.  (This  message  is  not 
timestamped.  Timestamped  messages  are  sent  out  on  ok jpync  unilinks 
only.)  A  site  j  in  N(i)  responds  to  this  message  only  if  it  is  in  operational 
state.  When  at  least  one  site  in  N(i)  has  responded  that  it  is  in  operational 
state,  the  site  i  enters  sync  state. 

2.2.4.10.3.  sync  -»  pause 

In  the  sync  state,  the  recovery  procedure  consists  in  synchronizing  the 
local  clock  with  neighboring  sites,  initializing  the  CRASH_OTHERS  and 
CRASH_£ELF  graphs  and  appointing  guards  for  its  upcoming  stay  in  the 


pause  state. 

The  logical  clock  C(i)  is  now  coupled  to  the  local  real-time  clock.  If  the 
first  technique  described  in  Section  2. 2.4.5  is  used,  the  local  real-time  clock 
which  may  have  stopped  functioning  in  the  crash  must  first  be  resynchron¬ 
ized  with  other  functioning  real-time  clocks  in  the  network. 

The  site  i  initializes  its  CRASH_PTHERS  and  CRASH _SELF  graphs  to 
show  all  the  unilinks  directed  from  it  as  in  DOWS  state  and  all  other  sites  and 
all  other  unilinks  in  the  network  as  LZPwith  the  current  value  of  C(i),  Le.  the 
TIME  fields  are  set  to  the  current  value  of  C(i). 

Next  the  SYNC-LINK  module  is  invoked  sequentially  on  each  of  the 
ok_unsync  unilinks  to  sites  that  have  signified  that  they  are  operational. 
When  a  SYNC-LINK  module  invocation  returns  with  the  corresponding  unil¬ 
ink  (i.j)  in  ok _yync  state,  the  site  i  asks  site  j  to  be  its  guard,  during  its 
upcoming  stay  in  the  pause  state  It  does  this  by  sending  an 
ENTER1NG_PAUSE  message  to  site  j  which  responds  with  an 
ENTER1NG_PAUSE_ACK  if  still  in  operational  state.  When  site  i  has 
appointed  one  or  more  guards,  it  enters  pause  state.  If  subsequently  the 
unilinks  to  all  its  guards  go  into  broken  state,  before  site  i  has  completed  its 
stay  in  pause  state  and  entered  operational  state,  the  site  enters  crashed 
state  again.  When  a  site  j  receives  an  ENTERING_PAUSE  message,  it  replies 
with  an  ENTERING_fAUSE _£CK  if  in  operational  state.  From  then  on,  till  it 

(a)  enters  crashed  state  itself, or 

(b)  receives  a  LEAV]NG_PAVSE  message  (  to  be  described  in  the  next  sec¬ 
tion)  or 

(c)  the  unilink  (J,i )  leaves  ok_sync  state, 

in  responding  to  any  SITEUP .LINKUP  or  LJNKSAFE  messages  (  to  be  dis- 
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cussed  below  ).  site  j  specifies  site  i  to  be  in  pause  state  in  its  response.  In 
the  last  case,  site  j  bumps  up  its  clock  by  A  before  replying  to  any  of  the 
aforesaid  messages,  by  which  time  site  i  will  be  aware  of  site  j  ceasing  to  be 
its  guard. 

The  larger  the  number  of  guards  site  i  appoints  before  entering  pause 
state,  the  less  likely  it  is  that  its  recovery  while  in  pause  state  will  have  to  be 
restarted  from  crashed  state  as  a  result  of  its  unilinks  to  all  its  guards  leav¬ 
ing  ok _sync  state. 

2.2.4.10.4.  The  STNCJJNK  Module 

The  function  of  this  module  is  to  synchronize  the  clock  of  the  invoking 
site  with  that  of  the  site  at  the  other  end  of  the  unilink  for  which  it  is 
invoked,  thus  setting  the  unilink  (  and  its  reverse  counterpart  )  to  ok_sync 
state. 

A  MY_JIUE_JS  message  carrying  the  current  value  of  C(i)  is  sent  on  the 
unilink  {i,j ). 

If  the  response  (  we  describe  the  appropriate  responses  to  messages 
sent  out  by  this  module  below  }  carries  a  clock  value  within  A  of  the  current 
logical  clock  value,  a  SYNCHED  message  is  sent  on  the  unilink  carrying  the 
current  value  of  C(i).  If  a  timestamped  SYNCH_ACK  message  is  received 
and  the  timestamp  is  within  A  of  the  current  value  of  C(i),  the  state  of  the 
link  is  set  to  ok (the  timestamp  being  used  to  set  the  corresponding 
LTR  register)  and  the  module  returns.  If  the  timestamp  on  the  SYNCH_ACK 
is  more  than  A  less  than  the  current  value  of  C(i),  the  SYNC_JJNK  procedure 


is  restarted. 
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If  the  response  to  the  MY_JIME_JS  message  carries  a  clock  value 
exceeding  the  current  value  of  C(i).  by  more  than  A,  the  clock  synchroniza¬ 
tion  module  is  invoked  to  bump  C(i)  up  to  the  clock  value  in  the  response. 
Then  the  SYNC_IjINK  procedure  is  started  again. 

If  the  response  to  the  UY^flUE^S  message  carries  a  clock  value  which 
is  less  than  the  current  value  of  C(i)  by  more  than  A,  the  module  restarts 
the  SYNCJLINK  procedure. 

If  a  SYNCH_NACK  message  is  received  in  response  to  the  S.YNCHED 
message,  it  may  bear  a  clock  value  exceeding  that  of  the  SYNCHED  message 
by  more  than  A.  In  this  case,  the  same  procedure  (described  above)  under¬ 
taken  when  the  response  to  a  MY^IME^JS  message  exceeds  the  current 
value  of  C(i)  by  more  than  A.  is  executed.  Otherwise,  the  SYNCJJNK  pro¬ 
cedure  is  simply  restarted. 

A  site  j  should  respond  to  the  messages  of  the  SYNC.J.INK  module  only 
when  in  operational  state. 

The  response  to  a  MY_TIME_JS  message  from  site  i  when  unilink  (J,i)  is 
in  ofc_i msync  state  (  the  unilink  is  set  to  broken  state  if  it  is  not  in  okjunsync 
state  when  this  message  arrives  )  is  to  bump  C(j  )  up  to  the  clock  value  car¬ 
ried  by  the  message,  if  necessary.  Then  a  MY_JIME_JS_/iCK  message  bearing 
the  current  value  of  C(J)  is  sent. 

The  response  to  a  SYNCHED  message  is,  if  the  clock  value  borne  by  it  is 
within  A  of  C(j ),  to  set  the  unilink  (j.i)  to  o*_yync  state  using  the  clock  value 
to  set  the  corresponding  LTR  register  and  to  return  a  timestamped 
SYNCH_£CK  message.  If  the  clock  value  of  the  SYNCHED  message  is  not 
within  A  of  C(j),  a  SYNCH_flACK  message  carrying  the  current  value  of  C{j) 
is  returned. 
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When  a  site  j  in  operational  state  sets  a  unilink  (j  ,i)  to  o k_pync  state,  it 
invokes  the  LINKUPJROADCAST  module  (described  below)  on  it. 

If  during  the  above  procedure,  the  unilink  (i.j)  gets  set  to  broken  state, 
the  SYNC_JjINK  module  returns  at  once. 

In  the  pause  and  operational  states,  this  module  is  invoked  whenever  a 
unilink  goes  from  broken  to  ok_unsync  state. 

2.2.4.10.5.  pause  -»  operational 

In  the  pause  state,  the  recovery  procedure  consists  in  informing  the 
network  of  the  recovery  of  the  unilinks  directed  from  the  site  and  of  the 
intended  transition  of  the  site  to  operational  state. 

The  site  i  starts  issuing  and  relaying  link  state  reports,  and  processing 
them  to  update  its  CRASH_OTHERS  and  CRASH _$ELF  graphs  as  described 
in  Section  2.2. 4. 8  with  the  exception  that  the  CRASH_J>ELF  module  is  not 
invoked  for  the  time  being.  It  responds  to  any  LINKUP,  UNKSAFE  and 
SJTEUP  messages  received  as  described  below. 

Next,  the  site  i  invokes  the  BROADCAST_LINKUP  module  (described 
below)  in  parallel  on  all  okjyync  unilinks  (i.j)  attached  to  it.  These  invoca¬ 
tions,  if  successful,  will  produce  broadcasts  of  UNKSAFE  messages  for  these 
unilinks.  (See  the  description  of  the  BROADCAST_l>INKUP  module  below  for 
a  description  of  UNKSAFE  messages.) 

When  all  the  BROADCASTS NKUP  module  invocations  have  returned,  if 
not  even  one  of  the  ok_$ync  unilinks  has  been  marked  UP  in  CRASH_SELF(i) 
as  a  result  of  receipt  of  a  UNKSAFE  message,  the  site  re-enters  crashed 
state.  If  at  least  one  of  the  okj ync  unilinks  has  been  marked  up  in 
CRASH_SELF(i),  the  CRASH_SELF  module  is  invoked.  From  this  instant  on, 
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the  CRASH_SELF  module  is  invoked  whenever  a  received  link  state  report 
makes  it  appropriate  to  do  so  as  described  in  Section  2.2. 4.8.  If  when  the 
module  returns,  the  site  has  not  entered  crashed  state,  the  following  pro¬ 
cedure  is  executed. 

The  site  t  broadcasts  a  timestamped  SJTEUP  message.  (The  responses 
to  this  message  as  well  as  to  the  LINKUP  and  L1NKSAFE  messages  should 
carry  the  timestamps  of  the  messages  to  allow  them  to  be  matched  up.)  The 
appropriate  responses  to  the  SITEUP  message  for  any  site  j  in  the  network 
are: 

(i)  if  in  crashed  or  sync  states,  respond  with  a  timestamped  SITEUP_JlCK. 

(ii)  if  in  pause  state  or  in  operational  state,  set  the  (STATE.  TIME)  fields  in 
CRASH_OTHERS(j)  and  also  in  CRASH _SELF\j)  for  site  t  to 
(UP, current  local  time),  provided  the  timestamp  on  this  message  is  greater 
than  the  TIME  fields.  If  in  operational  state,  the  ids  of  all  sites  that  site  j  is 
currently  guarding  should  be  added  to  the  acknowledgement. 

Site  <  periodically  resends  timestamped  SITEUP  messages  till  all  sites 
that  have  not  responded  are  marked  DOWN  in  CRASH_PTHERS(i).  In  addi¬ 
tion,  if  any  site  k  is  specified  as  being  in  pause  state  by  one  of  its  guards,  the 
site  and  its  guards  are  sent  SITEUP  messages,  till  either  the  site  responds 
or  its  guards  cease  specifying  site  k  as  being  in  pause  state  (  by  getting 
marked  DOWN  themselves  in  CRASH_f)THERS(i )  or  returning  SITEUP  ack¬ 
nowledgements  without  specifying  node  k  ).  In  the  latter  case,  either  the 
site  k  has  entered  UP  state,  in  which  case  site  i  will  have  received  a  SITEUP 
message  from  it,  or  all  the  guards  have  slopped  being  guards  before  site  k 
enters  operational  state,  in  which  case  site  k  reenters  crashed  state. 
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The  site  i  next  sends  a  LEAV1NG_PAUSE  message  to  those  guards  to 
which  its  unilinks  have  not  left  ok_$ync  state  since  the  time  they  responded 
to  site  i's  ENTERINQ_PAUSE  message.  It  waits  for  a  LEA V1NG_PA US£_ACK 
or  for  the  corresponding  unilink  to  enter  broJten  state.  When  either  of  these 
events  has  occurred  for  each  unilink  over  which  a  LEAVI N G_PAUSE  message 
was  sent,  it  enters  UP  state  if  at  least  one  LEAVI NG_PAUSE~ACK  is  received, 
otherwise  it  reenters  crashed  state.  Only  after  entering  UP  state  is  execu¬ 
tion  of  higher-level  software  resumed. 

The  original  SITEUP  message  is  sent  by  flooding  (  but  without  the  time 
constraints  imposed  on  the  flooding  mechanism  by  the  link  state  reports  ); 
all  responses  and  subsequent  SITEUP  retransmissions  can  be  sent  by  nor¬ 
mally  routed  messages.  Similar  considerations  hold  for  the  LINKUP  and 
LJNKSAFE  messages  discussed  in  the  next  section. 

2.2.4.10.6.  The  BROADCAST JJNKUP  module 

The  function  of  this  module,  when  invoked  on  the  unilink  (iji)  is  to 
ensure  that  the  CRASH_OTHERS graphs  in  the  network  have  been  updated  to 
show  the  unilink  (j,i)  UP  and  then  to  have  the  unilink  (ijf)  marked  UP  in 
CRASHJjlELF  graphs. 

When  a  link  (i ,j )  is  restored  from  broken  to  ok_sync  state,  this  informa¬ 
tion  is  broadcast  by  the  link  state  reporting  mechanism  described  in  Section 
H.2.4.7.  and  other  sites  appropriately  update  their  CRASH_OTHERS  graphs. 
However,  the  updating  of  the  link  state  in  the  CRASH_$ELF graphs  must  be 
postponed  till  it  is  certain  that  all  other  sites  have  updated  their 
CRASH^PTHERS  graphs,  in  order  to  leave  no  chance  of  rule  C3  being 
violated. 


Hence  the  BROADCAST_JJNKUP  module  which  is  responsible  for  getting 
unilinks  marked  UP  in  the  CRASH^SELF graphs,  does  this  job  in  two  phases. 

Phase  I:  The  site  i,  which  is  executing  the  BROADCAST^JJNKUP  module  for 
unilink  (tjf)  broadcasts  a  timestamped  LINKUP  message  carrying  the  LTR 
value,  say  tj,,  for  node  j.  The  responses  to  this  message  from  any  node  k 
are: 

(i)  if  in  crashed  or  sync  state,  return  a  timestamped  UNKUP_ACK  specifying 
site  It's  current  state. 

(ii)  if  in  pause  state,  the  (STATE, TIME)  fields  for  unilink  (j  ,i)  are  set  to 
( UP.ti ),  in  CRASH_OTHERS(k).  A  timestamped  LINKUP^CK  is  returned. 
The  identities  of  unilinks  directed  from  site  k  which  are  marked  DOWN  in 
CRASH_$ELF(k)  are  specified  in  the  LINKUP_£CK  along  with  their  TIME 
fields  in  the  same  graph. 

(iii)  If  site  it  is  in  operational  state,  the  unilink  (j.i )  is  set  to  (UP.ti)  in 
CRASH_OTHERS(k).  A  timestamped  UNKUP_ACK  is  returned.  The  identi¬ 
ties  of  those  unilinks  directed  from  node  k  which  are  marked  DOWN  in 
CRASH_£ELF(k)  are  sent  along  with  their  TIME  fields  from  the  same  graph 
in  this  acknowledgement.  The  ids  of  all  guarded  sites  should  also  be 
specified. 

Timestamped  LINKUP  messages  carrying  a  time  ti  are  resent  till  all 
sites  not  marked  DOWN  respond.  Responses  from  sites  specified  by  their 
guards  are  also  sought.  For  every  site,  whether  specified  as  guarded  or  not, 
from  whom  an  explicit  response  is  not  received,  the  following  procedure  is 
applied. 


Let  r  be  a  site  marked  DOWN  in  CRASH_OTHERS(i)  from  which  a 
response  is  not  collected.  Let  t  —L  be  the  maximum  of  the  TIME  fields  for 
those  neighbors  of  site  r  marked  DOWN  in  CRASH_£)T HERS (i).  Then  a 
LINKUP  message  timestamped  fef  must  be  sent  to  all  the  UP  neighbors  of 
site  r,  if  any  and  their  responses  collected.  (  If  any  of  these  VP  sites  get 
marked  DOWN  before  responding,  the  procedure  should  be  restarted  with 
these  neighbors  newly  marked  DOWN  included  in  the  set  of  neighbors  for 
whom  the  maximum  TIME  field  is  computed.  )  If  none  of  the  UP  neighbors 
specify  r  as  a  guarded  site  in  their  response,  site  r  if  still  marked  DOWN 
either  has  not  appointed  any  guardians  at  time  t  or  has  reentered  crashed 
state  as  a  result  of  all  unilinks  to  its  guardians  leaving  ok_sync  state.  Hence 
site  r  is  taken  to  have  implicitly  responded  that  it  is  in  crashed  or  sync  state 
with  a  message  timestamped  t . 

The  responses  are  processed  as  follows: 

(i)  A  response,  implicit  or  explicit,  from  a  site  in  crashed  or  sync  state  is 
treated  as  specifying  that  all  the  unilinks  directed  from  that  node  should  be 
marked  DOWN  in  CRASH _$ELF{i)  at  the  time  corresponding  to  the  response 
timestamp. 

(ii)  for  each  unilink  specified  as  DOWN  at  a  time  t  in  responses  from  sites  in 
other  states,  the  {STATE .TIME)  fields  for  that  unilink  are  set  to  {DOWN.t)  in 
CRASH JELF{i ) . 

The  site  i  then  enters  Phase  II. 

Phase  D:  The  site  i  broadcasts  a  LINKSAFE  message  for  unilink  (i  ,j),  carry¬ 
ing  the  time  fj.  This  message  carries  in  addition,  a  list  of  all  the  unilinks 
marked  DOWN  in  CRASH _S>ELF\i)  along  with  their  TIME  fields  from  this 
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graph.  The  responses  to  this  message  from  a  site  k  are: 

(i)  if  site  k  is  in  crashed  or  sync  state,  return  a  timestamped 
UNKSAFEJLCK. 

(ii)  if  k  is  in  pause  or  operational  state,  set  the  {STATE,  TIME)  fields  for  unil¬ 
ink  (i,j)  to  (UP,tL)  in  CRASH_SELF{k).  For  each  unilink  specified  as  DOWN 
at  a  time  t  in  the  LINKSAFE  message,  set  the  {STATE, TIME)  fields  for  that 
unilink  to  DOWN  at  the  time  t  in  CRASH_SELF{k).  A  timestamped 
L/NKSAFE_ACK  is  returned.  If  in  operational  state,  the  ids  of  alL  guarded 
sites  should  be  specified  in  the  acknowledgement. 

The  LINKSAFE  message  should  be  resent  till  all  sites  not  marked  DOWN 
respond.  Responses  from  guarded  sites  are  collected  as  for  5ITEUP  mes¬ 
sages.  Then  the  module  returns. 

The  module  returns  immediately  if  the  unilink  (i,j)  goes  to  broken  state 
in  Phase  1  or  Phase  II. 

2.2.4.11.  Correctness  Arguments  (D) 

In  this  section,  we  develop  analogs  to  Theorems  1  and  2  for  our  scheme 
with  recovery  procedures  incorporated. 

In  proving  the  analog  of  Theorem  1,  we  have  to  show  that  recoveries  of 
unilinks  and  sites  do  not  prevent  a  site  from  crashing  itself  in  time  when 
needed  if  it  has  been  marked  DOWN  at  some  other  site. 

Thm  3:  If  site  p  has  site  g*p  marked  DOWN  in  CRASH_PTHERS(p)  at  local 
time  f ,  site  q  is  not  in  operational  state  at  time  t . 

Proof:  Arrange  the  various  events  corresponding  to  marking  DOWN  of  sites  in 
increasing  order  of  the  local  times  at  which  they  occur  (  if  some  occur  at  the 


same  time,  arrange  according  to  increasing  id  of  the  site  at  which  the  event 
occurs  ). 

Let  t}  be  the  logical  time  of  the  j  th  such  event. 

INDUCTION  HYPOTHESIS  Hi:  If  site  p  has  site  q  *p  marked  DOWN  in 
CRASH_OTHERS(p)  at  local  time  t  <tj .  site  q  is  not  in  operational  state  at 
time  t . 

BASIS.  Obviously  true  for  j-l,  since  no  site  marks  any  other  node  DOWN 
before  t  j. 

Assume  Hi  true  for  j-x. 

Consider  the  marking  DOWN  event  at  tK,  which,  say.  is  the  marking  DOWN  at 
site  m  of  sites  in  component  C.  For  each  l  in  OTe,  let  7j  be  T1ME{1)  in 
CRASH_pTHERS(m)  at  the  time  the  component  C  was  found  to  be  suitable 
for  marking  DOWN  and  n*  be  the  node  in  N(C)  to  which  l  is  attached.  Let 
finut  =  max  I Ti  J  for  I  in  0TC  so  that  1  ^  |  A. 

Consider  a  node  a=r*  in  C  which  has  a  path  ks  |  Ve  |  bops  long  to  nj=r0  (  the 

path  being  . r,,r0).  We  will  show  that  at  tg,  site  a  has  at  least  one 

unilink  on  the  path  from  a  to  n*.  marked  as  DOWN  in  CRASH_£>ELF\a). 

Define  a  series  of  local  times  for  nodes  rft,rl,...rk  as  follows: 

(i)  iSo  is  defined  as  7j. 

(ii)  If  the  report  by  site  r0  at  7j  of  unilink  (w q.t-  ,)  being  DOWN  was  due  to  phy¬ 
sical  failure  of  the  bilink  between  rc  and  r,.  and  if  node  r,  is  in  pause  or 
operational  state  before  4q+A  tor  a  sufficiently  long  time,  its  link  monitor  will 
detect  this  failure  and  initiate  a  link  state  report.  Let  be  the  latest  time 


52 


before  1>0+A  when  the  monitor  initiates  a  broadcast  and  marks  the  unilink 
(rj,r0)  DOWN  in  CRASH_^ELf\ri),  if  this  situation  exists.  On  the  other  hand. 
rx  may  not  be  in  either  of  the  above  states  during  the  period  of  the  failure.  It 
is  also  possible  that  the  report  at  Tx  by  r0  may  be  caused  by  a  crash  of  r,, 
which  makes  the  unilink  from  r0  to  rx  enter  broken  state.  In  these  cases, 
where  the  link  monitor  in  rx  does  not  initiate  any  broadcast  of  the  failure 
which  caused  the  report  at  Tx,  we  define  dj  as  the  latest  time  before  d0+A 
when  rj  enters  crashed  state. 

(iii)  For  is  defined  as  follows: 

(a)  if  is  the  time  of  marking  DOWN  in  CRASH_OTHERS(r% _ ,)  and 
CRASH -EELFXti-i)  of  some  unilink  on  the  path  from  rt_,  to  r0,  as  a  result 

of  a  link  state  report  initiated  by  one  of  the  sites  rj,r2 . rt_,  and  if  this 

report  reaches  r,  by  dj-j+A.  dj  is  defined  as  the  time  at  which  this  unilink 
is  marked  DOWN  in  CRASHJ5ELF{tx )  as  a  result  of  receiving  this  report. 

(b)  However,  this  report  may  not  reach  rt  by  dt_,+A  because  of  the  failure 
of  the  bilink  between  ri_1  and  rt.  Alternatively,  rt_,  may  have  entered 
crashed  state  at  d,_i.  In  these  cases,  we  define  dt  as  the  latest  time 
before  d,_j+A  when  the  link  monitor  in  rt  detected  the  broken  or 
ok^unsync  state  of  the  unilink  from  rt  to  rt_j  and  initiated  a  link  state 
report  causing  this  unilink  to  be  marked  DOWN  in  CRASH_j^ELF\rt). 

(c)  Lastly,  if  site  rt  was  not  in  operational  or  pause  state  at  a  time  before 
tft_j+A  to  make  any  of  the  above  situations  occur,  we  define  as  the 
latest  time  before  dj.j+A  when  site  r4  enters  crashed  state. 


INDUCTION  HYPOTHESIS  H2; 
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(i)  *x*TL+iL 

(ii)  Site  rx  has  some  unilink  on  the  path  to  r0  in  front  of  it  marked 
DOWN  in  CRASH_$ELF{tx  )  from  d*  to  tK  at  every  instant  it  is  maintain¬ 
ing  this  graph. 

BASIS:  The  first  part  of  H2  is  true  for  i=l  because  dj£7|+A  by 
definition.  The  second  part  is  true  for  the  following  reason.  Since  r0 
has  reported  the  unilink  (ro.ri)  as  DOWN  at  7J,  the  unilink  (r,.r0)  can¬ 
not  become  ok_pync  before  7j.  Hence  the  LINKUP  broadcast  cannot  be 
started  before  Tt.  But  site  m.  which  marks  the  nodes  in  C  DOWN  at  tt, 
cannot  respond  to  this  broadcast,  whether  in  pause  or  operational 
state,  before  ts.  Further,  since  site  m  has  entered  pause  state  before 
7,  site  r,  cannot  avoid  waiting  till  this  response  is  received  from  site  m 
whichever  state,  pause  or  operational  it  is  in.  (  Remember  that  Hi  is 
assumed  to  be  true  for  j-x,  hence  site  r,  cannot  incorrectly  consider 
sites  in  operational  state,  including  those  that  might  be  guarding  m,  to 
be  DOWN  ).  Therefore,  the  corresponding  LINKSAFE  message  can  be 
broadcast  only  after  tz  by  fj.  Hence  the  unilink  (r|,r0)  remains  marked 
DOWN  in  CRASH_£>ELF\r ,)  from  dx  to  tg  at  any  instant  that  r,  is  main¬ 
taining  this  graph  in  this  period. 

Assume  H2  true  for  i<y. 

Consider  ry  which  either  crashed  at  or  marked  a  unilink  in  front  of  it 
on  the  path  to  rp,  DOWN  at  d¥.  In  the  first  case,  after  recovery,  the 
unilink  from  ry  to  ry_,  cannot  become  ofc_gync  before  d^_j.  Hence  the 
LINKUP  broadcast  for  this  unilink  cannot  be  started  before  dy-i.  In 
the  second  case,  the  marking  DOWN  is  the  result  of  a  link  status  report 


issued  at  by  rd  for  some  d^y,  reporting  the  unilink  (rd,rd_ 1)  to  be 
DOWN.  In  this  case,  the  UNKUP  broadcast  for  the  unilink  (rd.rg- 1) 
cannot  begin  before  In  either  case,  it  follows  from  our  inductive 

assumption  on  H2  and  the  way  the  responses  to  the  LINKUP  broadcast 
are  processed,  that  if  these  responses  are  received  before  f«,  the  ensu¬ 
ing  LINKSAFE  will  indicate  some  unilink  (f«.r«-i)  on  the  path  to  r0  in 
front  of  the  unilink  whose  safety  is  being  broadcast,  as  to  be  marked 
( DOWN.t )  in  the  CRASH _J5ELF  graph  ,  where  att,  of  every  site 
receiving  the  LINKSAFE  message. 

Hence  the  marking  UP  in  CRASH_SELF\ry)  of  the  unilink  in  front  of  ry 
marked  DOWN  by  it  at  i$v.  or  the  marking  UP  of  the  link  {ry,ry_x)  on 
recovery  if  it  crashed  at  will  be  accompanied  by  the  marking  DOWN 
of  some  unilink  in  front  of  the  unilink  being  marked  UP.  if  this  marking 
UP  occurs  before  tB.  If,  in  turn,  this  unilink  is  marked  UP  before  tm, 
some  other  unilink  in  front  of  it  on  the  path  to  r0  will  be  marked  DOWN. 
[  Ultimately,  the  unilink  (r,,rC)  may  be  marked  DOWN  and  the  LINK- 
SAFE  for  this  unilink  cannot  be  issued,  as  already  indicated,  before  tB. 
]  Hence  site  ry  will  have  some  unilink  in  the  path  to  r0  marked  DOWN  in 
CRASH_SELF[ry )  from  to  tB.  Moreover,  since  by 

definition  and  since  ‘Oy^l<Ti+(y-l)L  by  our  inductive  assumption  on 
H2,  it  follows  that  i?v<7}+yA. 

Hence.  H2  is  true  for  i=y  and  hence  for  i=l,2,...k. 

Thus  at  ta,  the  site  a  has  at  least  one  unilink  on  every  path  from  itself  to  n< 

marked  DOWN  in  CRASH_$ELF( o )  for  all  l  in  OTe. 

Hence  as  in  Theorem  1,  no  site  in  C  will  be  in  operalioruil  slate  at  tt. 
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Thus  at  tz,  the  CRASH_OTHERS  graph  of  every  site  in  operational  or  pause 
state  is  correct  in  that  it  shows  no  site  as  DOWN  that  is  actually  in  opera¬ 
tional  at  that  time. 

After  tK.  no  sites  are  marked  DOWN  in  any  CRAS H_0 T HERS  graph  till  fz  +  ) 
when  the  next  marking  DOWN  of  a  site  or  sites  occurs.  Hence  to  show  that 
the  graphs  remain  correct  in  this  period,  it  suffices  to  show  th*t  no  site 
enters  operational  state  before  informing  any  site  that  has  entered  pause  or 
operational  state  and  marked  it  DOWN  that  it  is  entering  operational  state, 
through  a  SITEUP  message. 

To  show  this,  we  order  the  events  corresponding  to  sites  entering  operational 
state  in  the  above  interval  in  increasing  order  of  times  that  they  enter  this 
state.  Consider  the  first  such  recovery,  say  of  site  w.  Site  m's 
CRASH_OTHERS  graph  was  correct  at  fz  and  is  correct  at  all  instants  to  the 
instant  it  enters  operational  state,  since  it  is  the  first  site  to  enter  opera- 
tional  state  after  fz.  Since  a  site  can  fail  to  have  itself  marked  UP  at  sites 
that  have  marked  it  DOWN,  whether  they  have  done  so  when  they  were  in 
pause  state  or  in  the  operational  state,  only  by  incorrectly  considering  sites 
in  operational  state  DOWN ,  it  follows  that  site  u>  does  get  itself  marked  UP  at 
all  appropriate  sites  before  it  enters  operational  state.  Hence  all  the 
CRASH^PTHERS  graphs  are  correct  at  the  time  of  the  first  entry  into  opera¬ 
tional  state  after  fz.  and  using  the  same  arguments,  at  every  subsequent 
entry  into  operational  state  thereafter  till  fz*j. 

Hence  Hi  is  true  for  j'=x+l,  and  therefore  for  ally,  proving  the  theorem. 
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Before  stating  the  analog  of  Theorem  2  for  the  case  where  recovery  of 
links  and  sites  does  occur,  we  introduce  some  notation.  A  unilink  (ij)  is 
safe  at  time  t ,  if  it  is  marked  UP  in  both  graphs  with  TIME  field  values  after 
which  it  has  not  left  ok_sync  state  till  time  f  at  all  sites  in  pause  or  opera- 
tional  state,  i.e.  it  is  safe  from  the  time  a  UNKSAFE  broadcast  for  it  has 
completed,  till  it  suffers  the  first  failure  after  the  initiation  of  the 
corresponding  BROADCAST_f!NKVP .  A  dynamic  component  C(t)  of 
NG=\V,E\  is  a  time-varying  graph  { Ve(t  ),E(t  )J,  such  that  Vc(t)  is  a  subset  of 
V  and  E(t)  is  a  subset  of  Ec(t),  the  set  of  arcs  in  NG  which  connect  nodes  in 
Vc(t ),  such  that  (V^(f  ),E'(t)\  is  connected  for  all  t.  Thus  nodes  and  unilinks 
enter  C(f),  stay  for  periods  of  time  called  membership  periods  and  then 
leave. 

A  dynamic  component  is  safe  during  the  period  (t  ltfg),  if 

(i)  each  site  in  the  component  is  operational  at  the  beginning  of  each  of  its 
membership  periods  in  this  interval  and  suffers  no  crashes  due  to  hardware 
or  software  failures  in  the  membership  period, 

(ii)  if  each  unilink  in  the  component  is  safe  at  the  beginning  of  each  member¬ 
ship  period  and  suffers  no  physical  failures  during  the  membership  period, 
and 

(iii)  if  |ve(0i> 

Thm  4:  If  a  dynamic  component  C(t )  is  safe  during  (tj.fg),  no  site  is  forced  to 
crash  itself  during  any  of  its  membership  periods  in  this  interval. 

Proof:  If  possible,  let  one  or  more  sites  in  C(t)  crash  themselves  during  their 
membership  periods  in  this  interval.  Let  p  be  the  site  in  C  which  is  the  earli¬ 
est  to  crash  itself  in  this  interval  during  one  of  its  membership  periods,  say 


for  all  t  in  the  given  interval. 


at  time  t .  Let  C*  be  the  component  that  satisfies  the  conditions  of  the  self¬ 


crash  procedure  which  p  finds  in  its  CRASH_£ELF  graph. 


Since  |  Vf .  |  < 


there  must  exist  a  node  q  *p  in  C(t)  which  is  also  in  N(C*),  and  a  path  of 
nodes  and  arcs  in  \  VS  (t),£{t)\  to  node  q.  Let  r  be  the  node  in  this  path  in 
B(C').  Since  p  is  the  first  site  in  C(t)  to  crash  itself  in  the  given  interval 
during  one  of  its  membership  periods,  sites  q  and  r  are  still  operational  at  t . 
Hence  the  unilink  ( [r,q )  which  was  safe  at  the  beginning  of  its  current 
membership  period,  cannot  have  been  marked  DOWN  in  CRASH _SELF\p) 
either  because  r  crashed  and  therefore  a  LINKSAFE  message  was  able  to 
specify  the  unilink  as  DOWN  or  because  q  crashed  and  site  r  reported  the 
unilink  DOWN  subsequently.  Further,  the  bilink  corresponding  to  this  unilink 
suffers  no  physical  failure  in  its  current  membership  period.  Hence  there 
exists  no  sequence  of  events  that  could  have  caused  unilink  (r,g)  to  be  DOWN 
in  CRASH_J>ELF(j> )  at  time  t,  contradicting  our  assumption. 


2.2.5.  Overhead  Considerations  and  Choice  of  Parameters 

In  the  Arpanet,  link  state  reports  are  broadcast  from  every  site  every  1 
minute  or  so  for  routing  purposes  and  broadcast  propagation  times  are  less 
than  1  second  (  typically  100ms  )  [MCQ  80]. 

Assuming  that  the  networks  under  consideration  have  similar  size  and 
communication  bandwidth,  we  can  choose  the  period  of  broadcast,  T /,=  1 
minute  so  that  the  communication  overhead  from  the  link  state  reports, 
which  in  any  case  are  needed  along  with  other  information  for  routing  pur¬ 
poses,  is  of  the  same  order.  When  no  link  or  site  failures  occur,  the  only 
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additional  communication  overhead  from  our  scheme  arises  from  messages 
for  clock  synchronization  and  for  link  monitoring.  For  every  unilink 
recovery,  the  messages  required  are  (i)  the  LINKUP  and  L1NKSAFE  broad¬ 
casts  and  (ii)  an  acknowledgement  from  each  site  to  the  broadcaster  for 
each  of  the  two  broadcasts.  As  mentioned  before,  the  broadcasting  is  done 
by  flooding.  In  our  algorithm,  as  presented,  when  a  bilink  recovers,  the  com¬ 
munication  costs  will  correspond  to  two  unilink  recoveries.  When  a  site  with 
L  attached  bilinks  recovers  from  a  crash,  the  communication  costs  will 
correspond  to  2 L  unilink  recoveries  plus  a  SITEUP  broadcast  and  its  ack¬ 
nowledgements.  Optimizations  in  which  the  messages  are  piggy-backed 
should  be  straightforward  but  are  not  explored  in  this  thesis.  Even  if  the 
LINKUP  broadcasts  for  the  2 L  unilink  recoveries  are  not  piggybacked,  they 
can  be  performed  in  parallel.  The  same  is  true  for  the  LINKS AFE  broad¬ 
casts.  Hence,  the  time  required  from  the  instant  a  site  recovers  physically 
to  the  instant  it  enters  operational  state  is  the  time  required  for  3  sequential 
broadcasts  (  LINKUP,  UNKSAFE  and  SITEUP  )  and  their  acknowledge¬ 
ments.  A  is  chosen  so  that  sending  a  timestamped  message  every  A  interval 
to  each  neighbor  is  not  a  burden.  Even  this  burden  is  absent  if  other  times¬ 
tamped  messages  concerned  with  normal  processing  are  being  exchanged. 
Further,  the  choice  should  give  the  site  sufficient  flexibility  in  its  schedule 
for  flooding  link  state  broadcast  messages.  A=  10  seconds  is  a  reasonable 
choice.  Tt  the  real  clock  timer  can  be  set  to  1~2  seconds  without  consuming 
an  appreciable  amount  of  site  resources  in  updating  the  local  clock.  A,  the 
interval  between  stable  storage  writes  can  be  chosen  as  •''SO  seconds  without 
using  up  much  disk  bandwidth. 
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2.3.  An  Algorithm  for  Multiple  Copy  Updating 


2.3.1.  Introduction 


In  this  section,  we  construct  an  algorithm  to  update  a  replicated  file  in  a 
point-to-point  network,  using  the  status  maintenance  scheme  described  in 
the  previous  section.  The  algorithm  can  be  briefly  characterized  as  follows: 

(i)  File  copies  that  are  used  in  executing  commands  from  transactions  can 
be  in  either  of  two  states,  called  the  HOT  and  WARM  states. 

(ii)  When  an  update  command  is  executed,  the  HOT  copies  are  (atomically) 
updated  immediately.  Periodically,  the  HOT  sites  bring  the  WARM  sites  up- 
to-date  by  sending  them  a  list  of  updates,  accumulated  since  the  last  time 
the  WARM  sites  were  brought  up-to-date.  Thus  the  HOT  copies  represent  the 
latest  version  of  the  file  at  all  times. 

(hi)  A  read  command  is  directed  to  a  HOT  site  if  the  current  version  of  the 
file  is  required.  If  it  is  not  essential  to  obtain  the  current  version,  the  com¬ 
mand  may  be  directed  to  a  WARM  site. 

(iii)  If  the  set  of  HOT  sites  is  depleted  due  to  site  crashes,  one  or  more  WARM 
sites  joins  the  set  as  necessary.  In  order  to  detect  such  a  depletion  when  it 
occurs,  the  status  maintenance  scheme  described  in  the  previous  section  is 


2.3.2.  Previous  Work  in  Updating  Replicated  Files 
A  file  is  replicated 

(i)  in  order  that  its  availability  may  be  preserved  in  the  face  of  failures. 

(ii)  in  order  to  reduce  the  response  time  for  read  access.  If  a  file  copy  exists 
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at  the  site  of  access  or  near  it.  the  response  time  is  less.  The  factors  which 
constrain  the  degree  of  replication  are  storage  costa,  update  execution  costs 
and  the  response  time  for  updates. 

The  costs  involved  in  processing  an  update  result  from  startup  (parsing 
the  command,  authorization  checking,  setting  up  the  necessary  processes 
and  communication  ports),  concurrency  control  (obtaining  the  required 
locks),  retrieval  of  appropriate  data  from  storage,  computing  the  new  values 
and  writing  them  back  to  storage,  and  commit  processing. 

The  response  time  for  an  update  increases  with  the  number  of  sites  that 
must  be  accessed  before  a  ‘done’  can  be  returned  to  the  originator  of  the 
transaction.  For  example,  in  the  'performance*  algorithms  of  distributed 
INGRES  [STO  79],  only  one  specially  designated  copy,  called  the  primary,  is 
updated  before  the  ‘done’  is  signaled.  If  most  of  the  update  transactions  ori¬ 
ginate  at  the  primary,  then  the  response  time  for  most  update  transactions 
will  be  similar  to  that  obtained  if  the  file  existed  only  at  the  site  of  origin  of 
the  transaction.  However,  if  the  primary  fails  before  relaying  the  update  to 
the  remaining  copies,  the  update  is  ‘lost’  which  can  result  in  a  catastrophe. 
Therefore,  in  the  'reliability'  algorithms  of  distributed  INGRES,  all  copies  of 
the  file  are  updated  atomically  and  then  a  'done'  is  signaled.  This  results  in 
higher  communication  costs,  a  higher  load  on  the  resources  of  the  sites  hold¬ 
ing  copies,  as  will  as  a  higher  response  time. 

Other  schemes  [THO  76,  GIF  79]  have  been  proposed  in  which  there  is  no 
designated  HOT  set  of  copies,  representing  the  latest  version  of  the  file,  at  a 
given  time.  Rather  the  HOT  set  may  'float'  from  update  to  update  even  when 
there  are  no  site  or  link  failures.  In  these  schemes,  a  majority  of  sites  hold¬ 
ing  copies  (  or  a  set  of  sites  holding  a  majority  of  votes  between  them  in 
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Gifford's  weighted  voting  scheme  [GIF  79]  )  must  be  accessed  before  any 
update  can  complete,  hence  the  costs  per  update  and  response  time 
increase  with  the  number  of  copies. 

In  our  solution,  the  response  time  and  immediate  costs  per  update  do 
not  increase  if  the  number  of  WARM  copies  is  increased.  Moreover,  the 
updating  of  the  WARM  copies  can  be  scheduled  when  there  is  surplus  capacity 
in  the  sites  involved  and  in  the  communication  system.  Further,  there  is  no 
commit  processing  in  updating  WARM  copies  and  the  storage  accessing 
sequence  for  installing  a  batch  of  updates  can  be  more  efficient  than  if  they 
are  installed  separately.  Hence  the  costs  for  the  deferred  updating  of  a 
WARM  copy  are  less  than  for  the  immediate  updating  of  a  HOT  copy. 

The  disadvantage  resulting  from  the  deferred  update  is  that  reading  a 
WARM  copy  may  not  give  the  latest  version  of  the  file.  However,  in  many 
cases,  the  latest  version  may  not  be  required  for  a  read  access.  For  exam¬ 
ple,  in  a  banking  application,  if  a  customer  has  just  made  a  withdrawal  of 
funds  and  makes  a  subsequent  query  about  his/her  balance,  the  withdrawal 
should  be  reflected  in  the  value  returned  in  answer  to  the  query.  Hence  a 
HOT  copy  should  be  used  to  answer  the  query.  But  a  transaction  computing 
the  sum-total  of  balances  of  all  customers  of  the  bank  will  not  require  the 
latest  value  of  each  balance,  and  can  make  use  of  a  WARM  copy  of  this  infor¬ 
mation.  In  other  situations,  if  an  old  version  is  obtained,  it  will  be  detected 
as  not  current  and  a  fresh  read  initiated.  File  catalogs  which  store  the 
whereabouts  of  files  in  the  network  are  an  example  of  such  a  case. 

An  important  component  of  any  scheme  for  managing  replication  is  the 
method  of  recovery.  In  our  algorithm,  a  recovering  site  first  obtains  a  WARM 
copy  and  then  obtains  '.  he  additional  updates  (  if  any  )  to  make  it  HOT  if  and 
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when  the  time  comes  for  it  to  join  the  set  of  sites  with  HOT  copies.  If  a  HOT 
copy  were  used  right  away,  the  file  would  be  locked  out  to  update  access  for 
the  entire  period  the  recovering  site  is  accessing  the  HOT  copy.  The  algo¬ 
rithms  of  [STO  79]  and  [BER  80]  make  use  of  a  reliable  queuing  mechanism  to 
buffer  updates  for  crashed  sites.  This  queuing  mechanism  achieves  reliabil¬ 
ity  by  storing  the  queued  messages  at  multiple  sites  in  the  network.  As  can 
be  seen  from  [HAM  80],  the  design  of  such  a  mechanism  can  be  quite  com¬ 
plex.  Besides,  if  a  site  is  down  for  a  long  time,  the  number  of  queued  mes¬ 
sages  for  it  may  be  so  large  that  storing  them  in  the  above  manner  may  be 
infeasible.  In  [GIF  79]  on  the  other  hand,  a  recovering  site  places  a  read  lock 
on  the  file  to  obtain  a  HOT  copy,  thus  preventing  update  access  for  the  dura¬ 
tion  this  process  is  occurring,  which  may  be  appreciably  long  in  a  long-haul 
network.  In  our  algorithm,  the  recovering  site  obtains  a  WARM  copy  from  a 
site  which  has  a  WARM  copy  if  possible  and  thus  avoids  locking  out  update 
access  on  the  HOT  copies.  Only  when  a  site  with  a  WARM  copy  is  entering  the 
set  of  sites  with  HOT  copies  is  a  read  lock  placed  on  a  HOT  copy.  Update 
access  to  the  file  is  prevented  while  the  lock  is  held.  However,  if  the  refresh¬ 
ing  interval  for  the  sites  with  WARM  copies  is  properly  chosen,  the  time  for 
obtaining  the  additional  updates  that  have  occurred  since  the  last  refresh 
will  be  small  compared  to  the  time  required  to  transfer  the  entire  file,  hence 
the  locking  out  period  will  also  be  comparatively  small. 

2.3.3.  Slates  and  Slate  Transitions 

The  complete  state  diagram  for  a  site  holding  a  copy  of  the  file  is  shown 
in  Fig.  2.7.  In  the  sequel,  we  will  refer  to  sites  in  state  S  as  S  sites  where  S 
may  be  DEAD, COLD, WARM, H OT  or  PRIMARY. 


A  site  is  in  DEAD  state  if  it  is  down  or  if  it  has  recovered  but  not  yet  ini¬ 
tiated  the  execution  of  software  for  file  recovery.  It  is  in  COLD  state  from  the 
moment  of  initiation  till  it  has  obtained  a  WARM  copy  of  the  file.  It  enters 
WARM  state  on  obtaining  a  WARM  copy.  While  in  the  WARM  state,  the  site  ser¬ 
vices  read  requests  that  do  not  necessarily  require  the  latest  version  of  the 
file.  If  it  does  not  crash,  eventually  it  enters  the  head  of  the  queue  of  WARM 
sites  waiting  to  join  the  set  of  HOT  sites.  When  the  next  failure  of  a  HOT  site 
occurs,  it  makes  itself  completely  up-to-date  and  enters  HOT  state.  While  in 
the  HOT  state,  the  file  copy  at  the  site  is  updated  atomically  with  other  HOT 
copies  when  an  update  request  is  received.  In  addition,  the  site  also  services 
read  requests,  which  thereby  obtain  the  latest  version  of  the  file.  The 
number  of  HOT  copies  is  maintained  at  P.  If  the  set  of  sites  holding  HOT 
copies  falls  below  P  in  strength,  update  requests  are  not  accepted  till  the  set 
regains  its  full  strength.  This  is  done  to  maintain  the  probability  of  ‘losing’ 
the  latest  version  of  the  file  below  a  given  level  determined  by  the  value  of  P. 
One  of  the  sites  holding  a  HOT  copy  is  designated  as  the  PRIMARY.  The  PRI¬ 
MARY  performs  two  tasks  in  addition  to  those  done  by  a  member  of  the  set  of 
HOT  sites.  First,  it  is  responsible  for  broadcasting  lists  of  accumulated 
updates  periodically  to  the  WARM  sites.  Second,  when  the  strength  of  the  set 
of  HOT  copies  falls  below  P,  it  is  responsible  for  helping  the  WARM  sites,  which 
join  the  HOT  set  as  a  result,  to  make  their  copies  HOT.  A  site  in  HOT  state 
enters  the  PRIMARY  state,  when  all  the  sites  that  were  in  the  HOT  or  PRIMARY 
states  when  it  entered  the  set  have  crashed.  In  order  to  have  at  most  one 
PRIMARY  and  at  most  P-1  HOT  sites  at  a  given  time,  the  sites  holding  copies 
of  the  file  form  themselves  in  a  queue  which  determines  their  priority  for 
entering  the  HOT  state  or  becoming  PRIMARY  (  Fig.  2.B  ).  This  queue  is 
ordered  in  increasing  order  of  the  times  on  the  global  clock  that  the  sites 
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FIG.  2.8.  CONFIGURATION  OF  SITES  FOR  UPDATING  THE  REPLICATED  FILE 
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not  in  DEAD  state  entered  the  COLD  state.  The  DEAD  sites  are  at  the  rear  of 
the  queue  in  arbitrary  order.  The  queue  is  maintained  at  each  site  not  in 
DEAD  state.  The  global  clock  allows  these  queues  to  be  maintained  in  a  con¬ 
sistent  yet  autonomous  manner,  i.e.  the  queues  are  not  updated  atomically, 
but  still  they  permit  the  sites  to  regulate  their  entry  into  the  HOT  or  PRI¬ 
MARY  states  on  the  basis  of  the  local  queue  while  maintaining  the  constraint 
on  the  number  of  sites  in  the  HOT  state  and  the  requirement  of  a  single  PRI¬ 
MARY.  Previous  algorithms  for  selecting  primaries  e.g.  those*  in  [STO 
79, GAR  82]  rely  on  some  form  of  atomic  updating  of  status  information. 
Therefore,  they  suffer  from  complications  which  arise  if  sites  crash  or 
recover  during  the  atomic  update. 

2.3.4.  The  ADA  Multitasking  Facility  and  Remote  Procedure  Calls 

In  the  appendix  of  this  chapter,  we  specify  our  algorithm  in  ADA.  The 
program  displayed  in  the  appendix  does  not  represent  an  existing  implemen¬ 
tation  of  the  algorithm,  but  is  intended  to  be  a  more  formal  specification 
than  the  informal  description  given  in  Section  2.3.6.  In  this  section,  we  out¬ 
line  the  multitasking  facility  of  ADA  and  a  remote  procedure  call  mechanism 
adequate  for  the  problem  at  hand. 

Consider  the  example  task  READERJVRJTER  taken  from  the  ADA  Refer¬ 
ence  Manual  [HON  79]  (Fig.  2.9).  The  procedure  READ  and  the  entry  WRITE  in 
the  task  declaration  at  the  top  can  be  called  by  other  tasks.  The  entries 
START  and  STOP  declared  in  the  task  body  can  only  be  called  within  the  task 
body  itself.  The  procedure  READ  can  be  executed  on  behalf  of  several  tasks 
simultaneously.  But  the  procedure  entry  WRITE  is  executed  by  the  task 
READERJIYR1TER  in  mutual  exclusion  and  only  when  it  reaches  an  accept 
statement  for  the  entry.  The  select  statement  allows  the  task  to  choose  any 
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task  READER_WRITER  to 

procedure  READ  (V  :  out  ELEM); 

•ntry  WRITEiE  :  to  ELEM); 
and; 

task  body  READER_WRITER  to 
RESOURCE  :  ELEM; 

READERS  :  INTEGER  ^  0; 

•ntry  START; 

•ntry  STOP; 

procedure  READ(V  :  out  ELEM)  to 

—  READ  is  a  procedura,  not  an  entry,  hanca  concurrant  calls  of  READ  arc  possible 
—  READ  synchronizes  such  calls  with  the  entry  calls  START  and  STOP 

begin 

START;  V  >«  RESOURCE;  STOP; 
and; 


accept  WRITEiE  :  to  ELEM)  do 
RESOURCE  E;' 

and; 

loop 

•elect 

accept  START; 

READERS  :«  READERS  ♦  1; 
or 

accept  STOP; 

READERS  :«  READERS  •  1; 
or  when  READERS  «  0  «> 

accept  WRITEiE  ;  to  ELEM)  do 
RESOURCE  »  E; 
and  WRITE; 
and  eelect. 
and  loop: 

and  READER_WRITER; 


FIG.  2.9.  EXAMPLE  TO  ILLUSTRATE  THE  ADA  MULTITASKING  FACILITY 


[HON  79] 
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one  of  several  alternatives.  In  the  example,  the  accept  statements  for  the 
entries  START,  STOP  and  WRITE  are  the  alternatives  in  the  select  statement. 
A  condition  may  be  associated  with  an  alternative  and  acts  as^i  ‘guard*  which 
controls  when  a  called  entry  may  be  executed.  For  example,  the  condition 
READERS=0  controls  the  execution  of  the  entry  WRITE.  A  delay  statement 
(not  used  in  the  given  example)  can  be  used  as  an  alternative  to  allow  the 
task  to  take  some  action  in  case  none  of  the  other  alternatives  are  execut¬ 
able  over  a  given  period  of  time,  either  because  there  are  no  calls  or  because 
the  guards  do  not  permit  execution. 

Tasks  are  started  through  an  initiate  statement.  If  an  entry  or  pro¬ 
cedure  is  called  when  the  task  containing  the  procedure  or  entry  is  inactive 
(  either  because  the  task  has  not  been  initiated  or  because  it  has  ter¬ 
minated  )  the  exception  TASKING_ERROR  is  raised  in  the  task  issuing  the  call. 

Although  communication  between  tasks  through  shared  variables  is  pos¬ 
sible  in  ADA,  our  program  makes  use  only  of  procedure  and  entry  calls  for 
communication.  Therefore  a  remote  call  facility  has  to  be  available  to  permit 
communication  between  tasks  at  different  sites. 

Apart  from  performance  considerations,  the  most  critical  issue  concern¬ 
ing  the  design  of  the  remote  call  facility  is  that  of  caZI  semantics,  ks  a  result 
of  duplication  of  messages  in  the  communication  system  or  as  a  result  of 
retrying  a  call  (  when  the  return  does  not  come  in  time  or  because  of  crashes 
in  the  caller  or  callee  )  multiple  executions  of  a  procedure  may  occur  as  a 
result  of  a  single  call.  Nelson  [NEL  81]  considers  the  alternative  ways  of  deal¬ 
ing  with  this  possibility.  In  of -leas! -once  semantics,  the  results  obtained 
by  the  caller  may  be  the  ones  obtained  from  any  one  of  multiple  executions 
of  the  procedure  caused  by  the  call.  In  last  -of  -many  semantics,  the 
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results  obtained  correspond  to  the  last  of  the  multiple  executions.  Achieving 
the  latter  is  complicated  by  the  occurrence  of  crashes,  since  a  crashed  site 
can  leave  calls  that  continue  to  execute  on  other  machines.  Lampson 
[LAM  Bl]  calls  executions,  whose  return  messages  do  not  reach  the  calling 
task,  orphans.  In  order  to  get  last-of-many  semantics,  a  crashed  site  on 
recovery  must  exterminate  its  orphans  before  retrying  the  call  or  else  adopt 
them  i.e.  use  their  results  in  completing  the  call  rather  than  retrying 
[NEL  Bl.  LAM  Bl].  Neither  option  is  inexpensive  to  implement. 

Here  we  do  not  develop  a  generalized  remote  procedure  call  (RPC) 
mechanism  but  restrict  ourselves  to  outlining  a  simple  design  that  is  ade¬ 
quate  for  our  file  update  algorithm.  Some  simplifications  arise  in  the  case  of 
our  algorithm.  First,  all  remote  calls  are  to  entries  rather  than  procedures. 
Hence  nothing  has  to  be  done  to  serialize  the  multiple  executions  that  may 
be  caused  by  a  given  call.  Lampson's  solution  for  last-of-many  semantics 
given  in  [LAM  Bl]  makes  use  of  an  extra  unique  id,  which  is  assigned  to  every 
call  and  is  included  in  all  retries  of  the  given  call,  to  effect  this  serialization. 
Second,  our  algorithm  does  not  assume  extermination  of  orphans  before  the 
recovery  of  a  crashed  site  is  begun,  hence  one  or  more  orphans  may  still  be 
executing  or  awaiting  execution  in  an  entry  queue  when  it  starts  its  recovery. 
T^ie  algorithm  does  not  require  any  task  to  retry  its  calls  after  a  crash.  But 
it  is  required  that  orphaned  executions  of  an  entry  terminate  before  a  new 
call  to  the  same  entry  by  the  recovered  site  is  started.  Our  requirements  for 
the  RPC  are  satisfied  by  the  following  design. 

Each  remote  call  is  converted  to  a  message  by  the  RPC  mechanism, 
timestamped  and  sent  to  the  destination  site.  A  timer  is  set  and  the  mes¬ 
sage  resent  with  a  new  timestamp  at  timer  runout  unless  one  of  the  following 
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events  occurs  before  the  runout: 

(i)  A  return  message,  bearing  the  timestamp  of  the  message  and  carrying  the 
results  of  the  entry  execution  caused  by  the  message,  is  received. 

(ii)  The  destination  site  is  marked  DOWN  at  the  calling  site.  (If  the  site  was 
already  marked  DOWN  at  the  time  the  call  was  received  from  the  calling  task, 
the  message  is  not  sent  at  all.) 

(iii)  A  message,  bearing  the  timestamp  of  the  message  sent,  is  received,  sig¬ 
nifying  that  the  task  containing  the  entry  is  not  active. 

In  the  first  case,  the  calling  task  resumes  execution  with  the  results  of 
the  return  message.  In  the  latter  two  cases,  the  exception  TASKINGJIRROR 
is  raised  in  the  calling  task.  The  message  is  resent  as  many  times  as  neces¬ 
sary  until  one  of  the  above  events  occurs  within  the  timeout  period. 

When  the  RPC  mechanism  at  site  <  is  initiated  after  recovering  from  a 
crash,  it  reads  the  global  clock  and  stores  the  obtained  value,  TSTART.  It 
ignores  all  received  call  messages  bearing  timestamps  less  than  TSTART.  It 
maintains  a  variable,  MAX(; )  for  each  site  j  which  sends  a  remote  call  mes¬ 
sage  to  site  t  with  a  timestamp  greater  than  TSTART.  MAX(j)  always  carries 
the  largest  timestamp  received  in  a  call  message  from  site  j .  A  call  message 
from  site  j  is  ignored  unless  its  timestamp  exceeds  MA X(j). 

When  a  call  message  is  received  by  site  i  which  is  not  to  be  ignored  for 
either  of  the  reasons  cited  above,  MAXO  )  is  set  to  the  timestamp  of  the  mes¬ 
sage  and 

(i)  if  the  task  containing  the  entry  is  active,  the  call  is  added  to  the  queue  for 
the  entry  called.  When  a  call  completes,  the  results  are  sent  in  a  message  to 
the  calling  site  with  the  timestamp  of  the  call'message. 

(ii)  if  the  task  is  not  active,  a  message  is  sent  to  the  calling  site,  carrying  the 
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timestamp  of  the  call  message  and  signifying  that  the  task  is  not  active. 

Our  design  ensures  last-of-many  semantics  for  the  no-crash  case.  If  the 
caller  crashes,  it  does  not,  on  recovery,  retry  any  call  it  may  have  been  exe¬ 
cuting  at  the  time  of  the  failure.  Rather  the  task  is  simply  restarted.  Thus 
we  do  not  obtain  last-of-many  semantics  here.  However,  in  this  design  all  the 
orphaned  executions  of  a  given  entry  are  guaranteed  to  terminate  before  the 
entry  is  executed  as  a  result  of  a  call  made  by  the  recovered  site. 

2.3.5.  Interface  to  the  Status  Maintenance  Mechanism 

Two  primitives  are  provided  by  the  site  maintenance  scheme  based  on 
the  global  clock  facility  described  in  Section  2.2. 

First,  the  function  READCLOCK  returns  the  current  value  of  the  global 
clock.  Second,  the  non-blocking  primitive  WATCHDOWN  can  be  invoked  on 
any  site  n  other  than  the  local  site.  The  site  status  maintenance  software 
invokes  a  designated  entry  in  the  task  invoking  the  watch,  immediately  if  the 
site  is  already  DOWN,  otherwise  when  the  site  is  next  marked  DOWN  in  the 
CRASH_OTHERS  graph.  The  id  of  the  site  on  which  the  watch  was  invoked 
and  the  current  reading  of  the  global  clock  are  supplied  as  parameters  when 
the  designated  entry  is  invoked.  If  WATCHDOWN  is  invoked  by  a  task  on  a  site 
when  it  already  has  a  watch  on  that  site  which  has  not  yet  returned,  the  invo¬ 
cation  has  no  effect. 

2.3.6.  Description  of  the  Algorithm 

We  now  give  the  description  of  the  algorithm  and  the  motivation  behind 
the  steps  involved.  It  is  assumed  that  the  granularity  of  locking  is  the  entire 
file.  There  is  a  version-number  associated  with  the  file  which  is  incremented 
on  each  update.  The  N  copies  of  the  file  are  assumed  to  be  at  sites  1  through 


N.  The  following  is  a  description  of  the  actions  taken  by  each  of  the  sites 
bearing  a  file  copy  in  each  of  the  states  shown  in  Fig.  2.7. 

The  following  actions  are  undertaken  by  site  i  (  l<i<N  )  In  state  COLD: 

(i)  read  the  local  clock  to  determine  the  time  of  entry  into  COLD  state  and 
store  it  in  the  variable  TREC(i). 

(ii)  broadcast  the  triad  (  local  id,  local  state,  TREC(i)  )  to  all  other  sites  which 
have  a  copy  of  the  file. 

(iii)  invoke  WATCH  DOWN  on  all  sites  which  have  a  file  copy. 

(iv)  send  requests  to  all  sites  which  have  file  copies  to  send  their  current 
state  values  together  with  the  times  at  which  they  entered  COLD  state  most 
recently. 

(v)  Wait  till,  for  each  of  these  sites,  the  required  information  has  been 
obtained  or  the  watch  on  it  has  returned. 

The  status  information  obtained  is  assembled  into  a  queue  (  in  step  (vi) 
below  )  that  is  ordered  in  increasing  order  or  the  TREC  values  for  sites  which 
have  reported  their  status,  while  the  DEAD  sites,  from  which  no  report  was 
received,  follow  in  any  order  after  them  Consider  any  site  j  (  *  i  )  holding  a 
file  copy.  If  the  watch  on  j  has  returned,  then  clearly,  site  j  can  re-entei 
COLD  state  when  it  recovers  only  with  TREC(j  )  >  TREC(t).  Therefore,  when  it 
recovers  and  forms  its  queue,  it  will  find  that  site  t  has  a  smaller  TTtEC  value 
and  will  put  itself  behind  site  t  in  its  queue.  Also  sites  which  have  reported 
larger  TREC  values  will  do  the  same.  Thus  these  sites  which  site  i  will  put 
behind  it  in  its  queue  in  step  (vi)  will  put  themselves  behind  it  in  theirs  and 
thus  their  queues  will  be  consistent  in  this  sense.  Similarly  sites  which  have 
lower  TREC  values  than  i  and  which  therefore  are  ahead  of  it  by  site  i  will  put 
it  after  themselves  in  their  queues. 
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The  wait  in  step  (v)  will  always  terminate  since  for  every  site  j  *  i  either 

(a)  site  j  will  respond  to  the  request  in  step  (iv)  or 

(b)  the  WATCHDOWN  placed  on  it  at  step  (iii)  at  site  t  will  return  or 

(c)  if  site  j  recovers  from  a  crash  before  the  WATCHDOWN  on  it  is  placed  (  so 
that  the  WATCHDOWN  does  not  return  )  but  the  status  request  from  site  »  in 
step  (iv)  arrives  while  the  site  has  not  initiated  its  file  update  software  (  so 
that  the  request  does  not  get  a  response  ).  then  site  j  will  send  the  neces¬ 
sary  information  when  it  itself  executes  step  (ii). 

(vi) For  each  site  with  a  file  copy,  form  the  triad  (  site  id,  site  state,  time  of 
entry  TREC  into  COLD  state  ).  For  sites  that  have  had  WATCHDOWN  on  them 
return,  the  site  state  is  set  to  DEAD  and  the  time  of  entry  into  COLD  state, 
TREC,  is  set  to  the  time  supplied  by  the  site  maintenance  mechanism  when  it 
returns  the  watch.  For  the  others  these  variables  are  set  to  values  obtained 
from  their  status  reports.  The  triads  are  then  ordered  in  a  queue  called 
STATUS_8  according  to  increasing  values  of  TREC  for  the  non-DEAD  sites,  with 
the  triads  for  the  DEAD  sites  following  in  arbitrary  order.  Moreover  from  now 
on  till  the  site  i  crashes,  the  status  of  these  sites  is  updated  as  status 
reports  are  received  from  them  concerning  their  state  transitions  and  as  the 
site  status  maintenance  mechanism  reports  crashes  among  them  (  a  WATCH- 
DOWN  is  always  maintained  on  all  file-cbpy-bearing  sites  other  than  i  itself, 
which  are  recorded  as  not  DEAD  in  the  STATUS_fi.  )  After  each  update,  the 
queue  is  rearranged  if  necessary  to  reflect  the  ordering  rules  mentioned 
above. 

(vii)  Initiate  tasks  to  receive  and  buffer  update  lists  broadcast  by  the  PRI¬ 
MARY. 

(viii)  Start  a  task  to  maintain  a  list  of  received  updates  so  that  if  and  when 
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tne  site  becomes  PRIMARY  it  is  able  to  carry  out  its  responsibility  of  bringing 
WARM  sites  up-to-date  at  regular  intervals. 

(ix)  W'ait  till  the  first  update  list  broadcast  by  the  current  PRIMARY  arrives. 

(x)  Obtain  the  lowest  version  number  V  corresponding  to  an  update  in  the 
first  update  list  received. 

(xi)  Obtain  a  copy  of  the  file  warmer  than  V.  For  this  purpose,  a  site  marked 
WARM  in  the  STATUS_ft  is  used  in  preference  to  the  sites  marked  HOT  or  PRI¬ 
MARY  in  order  to  avoid  locking  out  the  file  to  update  access.  The  latter  sites 
are  used  if  no  site  marked  WARM  is  in  STATUS_fl. 

With  reference  to  step  (viii)  some  explanation  is  required.  A  site  i  must, 
from  the  time  it  begins  receiving  update  lists,  maintain  a  list  L  of  updates 
that  it  cannot  be  sure  have  reached  all  the  appropriate  sites,  i.e.  the  sites 
that  have  initiated  tasks  to  receive  update  lists  in  step  (vii)  above.  Let  X  be 
the  version  number  of  the  latest  update  it  knows  to  have  reached  all  the 
appropriate  sites.  (  When  it  receives  its  first  update  list,  X  is  set  to  one  less 
than  the  lowest  version  number  of  an  update  in  the  list.  This  is  because  the 
PRIMARY  always  finishes  sending  one  list  of  successive  updates  to  the 
appropriate  sites  before  starting  the  next  batch.  )  When  the  next  update  list 
arrives,  site  i  sets  X  to  max  (X,Y)  where  Y  is  the  number  one  less  than  the 
lowest  version  number  in  the  new  list  and  only  preserves  in  L  those  updates  it 
has  received  that  have  version  numbers  greater  than  X.  In  this  way  it  contin¬ 
ues  till  it  enters  HOT  state. 

In  the  HOT  state,  site  i  itself  takes  part  directly  in  every  update  and 
updates  are  added  to  L  as  soon  as  they  are  committed.  When  an  update  list 
arrives  from  the  current  PRIMARY,  then,  if  Y  has  the  same  significance  as 
above,  all  updates  prior  to  Y  are  deleted,  as  they  can  be  inferred  to  have 


already  received  by  the  appropriate  sites. 


In  the  PRIMARY  state,  updates  continue  to  be  directly  added  to  L  as  they 
are  committed.  Periodically  alt  the  updates  in  L  are  broadcast  to  the 
appropriate  sites  and  when  the  broadcast  is  complete,  the  updates  that  have 
been  broadcast  are  deleted  from  L. 

It  is  possible  that  no  update  list  is  ever  received,  or  even  if  it  is.  that 
step  (xi)  above  cannot  be  completed.  This  can  happen  if  there  is  a  sequence 
of  failures  such  that  only  W ARM  or  COLD  sites,  if  any,  are  left  in  front  of  site  i 
in  STATE_ft.  In  the  latter  case,  the  reason  for  not  being  able  to  complete 
step  (xi)  will  be  that  these  sites  had  not  been  sent  the  update  corresponding 
to  version  number  V  (  the  lowest  version  number  in  the  first  update  list 
received  )  when  the  last  HOT  site  failed,  leaving  them  stranded.  This  means 
not  only  that  site  i  cannot  enter  WARM  state,  but  that  the  site  at  the  front  of 
STATELfi  cannot  enter  HOT  state.  This  is  a  catastrophic  failure  since  updates 
can  no  longer  be  performed  on  the  file  and  hence  requires  manual  interven¬ 
tion  to  reinitialize  the  system.  The  signal  for  manual  intervention  is  made  by 
the  WARM  or  COLD  site  that  finds  itself  at  the  head  or  its  STATELfl- 

After  step  (xi)  above  has  been  performed,  the  site  <  has  entered  WARM 
state.  In  this  state  its  actions  are: 

(i)  Broadcast  the  news  of  the  state  transition  to  the  other  sites  having  file 
copies. 

(ii) lnitiate  tasks  to  do  the  following  tasks: 

(a)  Respond  to  sites  wishing  to  acquire  WARM  copies  of  the  file 
so  that  they  can  enter  WARM  state. 

(b)  Consolidate  the  buffered  update  lists  with  the  local  copy  of 
the  file,  as  these  lists  become  available. 


(iii)  Wait  till  it  is  time  to  enter  HOT  state. 

This  wait  terminates  when  site  t  perceives  itself  to  be  in  one  of  positions 
1  through  P  in  its  STATEL.Q-  Typically  this  will  happen  when  site  <  moves  from 
the  (P+l)th  position  to  the  Pth  position  as  a  result  of  a  failure  of  one  of  the 
sites  in  the  first  P  positions.  But  sometimes  site  t  when  entering  WARM  state 
may  already  find  itself  in  one  of  the  first  P  positions  because  of  a  large 
number  of  failures.  When  the  wait  terminates  the  following  steps  are  exe¬ 
cuted  prior  to  entering  HOT  state. 

(iv)  Obtain  the  id  of  the  PRIMARY  from  the  local  STATEJ_ft- 

(v)  Request  the  PRIMARY  to  supply  the  current  version  number  of  the  file. 
The  PRIMARY  locks  out  update  access  on  the  file  for  a  period  of  time,  expect¬ 
ing  site  i  to  complete  its  transition  to  HOT  state  in  this  time. 

(vi)  After  getting  the  current  version  number,  wait  till  the  version  number  of 
its  local  copy  reaches  this  version  number  as  a  result  of  the  merging  of 
received  update  lists. 

(vii)  Inform  all  other  sites  with  file  copies  by  means  of  a  status  report  that  it 
is  entering  HOT  state. 

(viii)  Perform  a  handshake  with  the  the  site  from  which  the  version  number 
was  obtained  in  step  (v).  If  the  handshake  is  successful,  this  means  that  no 
updates  have  occurred  since.  Further,  every  site  that  can  become  PRIMARY 
from  now  on  till  site  i  itself  becomes  PRIMARY  or  crashes,  has  site  i  marked 
as  HOT  in  its  STATF_fl.  This  ensures  that  site  i  always  participates  in  every 
future  atomic  update  of  HOT  copies  till  it  crashes.  This  guarantee  holds  since 
the  update  transaction  co-ordinator,  before  committing  an  update,  checks 
with  the  current  PRIMARY  to  make  sure  that  all  HOT  sites  have  signified  their 
agreement  to  commit  the  update  {  see  below  ).  Hence  site  t  can  enter  HOT 
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state.  On  the  other  hand,  the  handshake  may  not  complete  successfully. 
This  may  happen  for  one  of  two  reasons.  The  site  which  supplied  the  version 
number  in  step  (v)  may  have  crashed.  It  may  have  removed  the  readlock  it 
had  placed  on  its  local  file  copy  because  site  i  did  not  perform  the 
handshake  in  the  allotted  period,  in  wh»ch  case  it  will  refuse  to  participate  in 
the  handshake.  In  either  case,  the  guarantee  mentioned  above  cannot  be 
provided  and  hence  site  i  must  go  back  to  step  (iv). 

It  may  happen  that  all  the  sites  in  front  of  site  t  crash  before  it  can 
complete  the  handshake.  This  is  again  a  catastrophic  failure  since  no  site  can 
become  HOT  now  without  outside  intervention.  The  same  signaling  mechan¬ 
ism  mentioned  above  comes  into  play,  i.e.  site  t  in  WARM  state  finds  itself  at 
the  head  of  STATE_Cj  and  therefore  invokes  manual  intervention. 

In  HOT  state,  site  i  performs  the  following  actions: 

(i)  Initiate  a  task  to  participate  in  performing  atomic  updates  in  co-operation 
with  the  other  HOT  sites,  the  PRIMARY  and  the  transaction  co-ordinator. 

(ii)  Wait  till  it  is  time  to  become  PRIMARY. 

The  wait  terminates  when  site  i  becomes  the  first  in  its  STATE_fi.  It  then 
enters  PRIMARY  state,  in  which  it  performs  the  following  actions: 

(i)  participate  in  performing  atomic  updates  in  co-operation  with  the  HOT 
sites  and  the  transaction  co-ordinator. 

(i)  periodically  broadcast  all  the  accumulated  updates,  completing  one 
broadcast  before  starting  the  next. 

(ii)  initiate  a  task  to  respond  to  requests  from  WARM  sites  for  help  in  enter¬ 
ing  HOT  state.  This  task  does  the  following.  .When  a  request  for  the  current 
version  number  is  received  from  a  WARM  site,  it  sets  a  readlock  on  its  local 
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file  copy.  It  then  obtains  the  local  version  number  and  returns  it  to  the 
requesting  site,  at  the  same  time  initiating  a  broadcast  of  an  update  list  car¬ 
rying  updates  upto  to  the  current  version  number,  so  that  the  requesting 
site  can  quickly  make  itself  current.  It  waits  for  a  given  period  of  time  for 
the  requesting  site  to  perform  a  handshake  signifying  that  it  has  accom¬ 
plished  this  and  informed  all  all  the  sites  with  file  copies  that  it  is  entering 
HOT  state.  If  the  handshake  occurs  within  the  given  period,  site  i  releases 
the  readlock  after  the  handshake.  Otherwise  the  readlock  is  released  at  the 
end  of  the  allotted  period  and  site  t  refuses  to  do  a  handshake  with  the 
requesting  site,  obliging  it  to  start  all  over  again  by  asking  for  the  current 
version  number. 

Lastly,  we  describe  how  the  f«*omic  update  occurs.  The  co-ordinator  of 
the  update  transaction  sends  the  updates  to  the  sites  it  believes  to  be  HOT 
and  to  the  site  it  thinks  is  the  PRIMARY.  (The  site  where  the  co-ordinator 
resides  can.  when  needed  obtain  this  information  either  from  one  of  the  sites 
with  file  copies,  or  these  latter  can  themselves  broadcast  their  transition 
into  HOT  and  PRIM.M  t  states  to  the  entire  network.)  On  receiving  the  update, 
a  HOT  site  obtains  a  writelock  on  its  local  copy  and  responds  'ready*.  The  PRI¬ 
MARY  obtains  a  writelock.  and  then  gets  the  set  of  HOT  sites  from  its 
STATEjfi.  If  the  number  of  HOT  sites  is  not  P-1,  the  PRIMARY  releases  the  wri¬ 
telock  and  rejects  the  update,  otherwise  it  responds  'ready'  and  sends  the 
list  of  HOT  sites  along  with  its  response. 

The  co-ordinator  commits  the  transaction  if  and  only  if: 

(i)  exactly  one  site  sends  a  list  of  HOT  sites,  i.e.  only  one  site  responds  in  PRI¬ 
MARY  state,  along  with  its  response. 

(ii)  all  the  sites  indicated  in  the  list  and  the  PRIMARY  respond  'ready'. 


Otherwise  the  transaction  is  aborted.  On  receiving  the  commit  or  abort 
signal,  the  HOT  sites  and  the  PRIMARY  perform  or  ignore  the  update  accord¬ 
ingly  and  release  the  writelock. 

In  addition,  the  co-ordinator  must  make  use  of  a  reliable  commit  facility 
that  ensures  that  even  if  the  co-ordinator  crashes  at  any  time  during  the 
transaction,  the  sites  being  updated  all  receive  an  abort  or  all  receive  a  com¬ 
mit  signal.  This  can  be  done  using  commit  backups  [HAM  BO].  This  is  to 
ensure  that  sites  being  updated  are  not  left  holding  the  writelock,  not  know¬ 
ing  whether  to  commit  or  abort  the  transaction. 

The  algorithm  for  the  co-ordinator  and  its  backups  is  given  in  [HAM  BO] 
and  hence  is  not  displayed  in  the  appendix  to  this  chapter. 

2.3.7.  Choice  of  Parameters 

The  parameters  of  the  algorithm  are  TBROD,  the  refresh  interval;  N.  the 
total  number  of  copies;  and  P,  the  number  of  HOT  copies. 

The  choice  of  TBROD  should  be  made  on  the  basis  of  how  up-to-date  the 
information  in  the  WARM  copies  is  required  to  be,  and  the  constraints  on  the 
buffer  space  in  which  updates  being  accumulated  for  broadcast  are  stored. 

The  value  of  N-P  will  depend  on  the  number  of  sites  where  the  frequency 
of  read  commands,  which  do  not  require  the  most  up-to-date  information,  is 
high.  It  can  be  quite  large  since  increasing  N  does  not  cause  a  penalty  to  be 
paid  in  the  response  time  and  immediate  processing  required  for  updates. 

The  value  of  P  depends  on  the  number  of  sites  where  the  frequency  of 
read  commands  which  do  require  the  latest  information  is  high,  and  on  the 
amount  of  protection  desired  against  the  possibility  of  the  catastrophic 
failure  in  which  no  HOT  copies  are  left  with  UP  sites.  This  failure  requires 
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manual  intervention  to  determine  which  sites  have  the  most  current  version 
and  re-initialize  the  system. 

We  show  below  a  rough  computation  to  determine  how  much  protection 
a  given  value  of  P  gives  against  catastrophic  failure. 

Given  the  value  of  TBROD  and  the  frequency  and  average  size  of  updates, 
we  can  compute  the  amount  of  data  that  must  flow  from  the  PRIMARY  to  a 
WARM  site  to  make  its  copy  HOT,  and  thence  the  amount  of  time  required. 
Assume  that  this  time  period  is  exponentially  distributed  with  time  constant 

7 r~~p  We  assume  that  N  is  large  enough  that  there  are  always  a  sufficiently 

large  number  of  sites  which  have  been  up  long  enough  to  become  WARM. 
Thus  when  the  set  of  HOT  copies  suffers  a  loss  of  one  or  more  copies,  the 
introduction  of  new  HOT  copies  is  not  delayed  by  the  non-availability  of  WARM 
copies. 

Assume  that  the  period  for  which  a  site  is  UP  is  exponentially  distri¬ 
buted  with  a  time  constant  Tn  - 

We  wish  to  find  the  expected  time  that  elapses  starting  from  a  state  in 
which  P  HOT  copies  exist  to  a  state  where  none  exist. 

Fig  2.  10  shows  the  state  diagram  with  the  state  transitions.  The  updates 
broadcast  for  the  purpose  of  a  making  a  WARM  copy  HOT  reach  the  other 
WARM  sites  in  parallel,  and  there  is  always  a  large  number  of  WARM  sites 
assumed  present.  Therefore  when  a  WARM  site  in  the  process  of  making  its 
copy  HOT  fails,  its  place  is  just  taken  by  the  next  WARM  site  in  the  queue  and 
the  process  of  making  its  copy  HOT  continues  where  it  was  left  off  for  the 
crashed  site.  Therefore  we  need  not  concern  ourselves  with  failures  of  WARM 
sites.  Note  that  because  of  our  assumption  of  exponential  distributions,  the 


random  variable  NP(t ),  the  number  of  HOT  copies  at  time  t  is  memory-less. 

From  probability  theory,  it  can  be  shown  that  for  states 

NP=r,r  =  1,2 . P-1,  the  time  spent  in  the  state,  given  the  next  state,  has 

the  same  distribution. 


w=— 


rX+/i 
for  r  =  1,2,. ..P-1,  and 


where  7(.)  is  the  mean  of  the  time  spent  in  a  state. 


Further,  the  transition  probabilities  are 

P(r.r-  l)=r~ — 

'  kr+fj. 

for  r  =  1,2, ...P-1, 

P(P,P-1)=1 

and 


P(r,r  +  1)= 


_JL_ 

XT+fJ. 


for  r  =  1,2.3... .P-1. 


Let  Xr  be  the  mean  time  taken  for  the  first  transition  from  state  NP-r 
to  state  NP=0.  Then  we  get 
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bors,  rather  than  its  response  (  or  lack  thereof  )  to  a  probe  message  from  an 
arbitrary  point  in  the  network.  This  prevents  sites  from  being  mistakenly 
marked  as  DOWN ,  when  there  is  a  routing  failure,  or  the  sites  are  heavily 
loaded  and  slow  to  respond,  etc. 

The  overhead  caused  by  the  scheme  is  of  the  same  order  as  the  new 
Arpanet  routing  method.  In  fact  some  of  the  processing  is  common  and  can 
be  merged. 

Based  on  this  status  maintenance  scheme,  we  developed  a  method  for 
updating  a  replicated  file.  The  use  of  the  status  maintenance  scheme  allows 
the  sites  to  perform  reconfiguration  actions  (e.g.  to  take  over  the  functions 
of  a  crashed  site)  independently  rather  than  making  the  reconfiguration 
decisions  collectively,  with  all  the  file-copy-bearing  sites  taking  part.  The 
method  allows  read  access  to  be  performed  inexpensively  when  it  is  not 
necessary  to  obtain  the  latest  information,  through  the  use  of  WARM  copies. 
The  addition  of  WARM  copies  does  not  cause  the  performance  of  update  tran¬ 


sactions  to  deteriorate. 
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The  toted  number  of  sites  carrying  a  copy  of  the  file  is  N.  Further 
assume  the  total  number  of  HOT  sites  plus  the  PRIMARY  is  sought  to  be  main¬ 
tained  at  P.  Below  we  specify  the  package  COMMON  and  a  set  of  task  families 
each  having  N  members,  one  for  each  site  bearing  a  file  copy.  The  program 
for  each  such  site  consists  of  the  package  COMMON  and  one  member  of  each 
task  family. 


package  COMMON  is 

type  SI  TESTATE  is  (DEAD.  COLD,  WARM,  HOT.  PRIMARY) ; 
type  S1TE_1D  is  INTEGER  range  1..N; 
type  COPY_SET  is  array(l..N)  of  BOOLEAN; 
type  VERSION IJ'JUMBER  is  0 .. SYSTEM 1  MAX_J NT ; 
type  TRIAD  is  record 

NO:  S 1  TE_I D ; 

STATE:  SITE_STATE; 

TR:  TIME; 

end  record; --this  record  type  is  used  to  trahs-nit 
--and  store  site  status, 
type  UPDATE_PACKAGE  is  record 

LVN:  VERS  I  ON LNUMBER ; 

HVN:  VERS  1 ON_NUMBER ; 

UPDATES:  array  (LVN. .HVN)  of  UPDATE; 
end  record; - -used  in  broadcasting  update-lists 
--to  WARM  s i tes . 

function  GETJtfY_JD  return  TASK_)D; - -returns  the  id  of 

- -the  cal  1 ing  task. 


end  COMMON; 


First  we  give  the  task  family  declarations. 

task  F1LE_REC0VER(1NTEGER  range  1..N); 

—this  task  initiates  the  other  local  tasks  and  co-ordinates 
—the  entire  recovery  process. 

task  STATUS_REPORT_SENDER( INTEGER  range  1..N)  is 

--this  task  sends  the  site  status  in  response  to  requests, 
entry  STATU SJtEQUEST ( NOD : S 1 TE_1  D ) ; 


end  STATUS _REPORT_£ENDER; 


task  STATUS _REPORT J*ECE I VER ( INTEGER  range  1..N)  is 

-•this  task  receives  status  reports  from  other  sites  and 
--calls  another  task  to  update  the 
--locally  stored  status  information  accordingly, 
entry  STATUS_REPORT(T: TRIAD) ; 

end  STATUS_REPORT_RECE 1 VER ; 

task  SI TE_CRASH_J)ETECTOR( INTEGER  range  1..N)  is 

--this  task  places  WATCHDOWNS  on  the  sites  bearing  file  copies 
--which  are  up  and  calls  another  task  to  update  the  locally 
--stored  status  information  when  a  watch  returns, 
entry  S1TEUP(N0D:S1TEJD) ; 

entry  WATCHJXJWN J NTERRUPT(NOD :  S 1  TE_JD;T : TIME) ; 
end  S 1 TE_CRASK J3ETECT0R ; 


task  UPDATE_RECE1VER( INTEGER  range  1..N)  is 

--this  task  receives  update-lists  from  whichever  site  is 

--currently  PRIMARY  til!  the  site  in  which  the  task  resides 

--itself  enters  PRIMARY  state. 

entry  UPDATE_L1ST(UP:UPDATE_PACKAGE) ; 

entry  ENTER]  NGJIOT; 

entry  QUIT; 

end  UPDATE_RECE 1 VER ; 


task  C0PY_2TATUSJCEEPER( INTEGER  range  1..N)  is 

--this  task  maintains  the  status  of  each  site  bearing  a  file 
--copy  in  a  queue  hereafter  referred  to  as  STATUS_Q. 
entry  GET_?R1MARY JD(NOD:  out  SJTE_ID); 
entry  WA1 TjrO_JECOME_PRIMARY; 
entry  WA1 T_JO_ENTER_HOT ; 
entry  WA1TJT1LLJN1T; 
entry  GET_tfOTLlST(S: out  COPY_SET); 

entry  GET_2TATUS( NOD :S1TEJD;  STATE:  out  S 1  TESTATE ; T :  ou t  TIME) 
entry  UPDATE(T: TRIAD) ; 

end  COPY_STATUSJ<EEPER; 


task  UPDATE_COLLECTOR( INTEGER  range  1..N)  is 

--this  task  performs  the  buffering  for  update_lists  received 


--fron  the  PRIMARY  till  they  are  integrated  into  the  local 
--copy  of  the  file. 

entry  ADD_T0JJ1ST(UP:UPDATE_PACKAGE) ; 
entry  GET_T1RST_VERS10N_MJMBER(VN:  out  VERS 1 ON JJUMBER) ; 
entry  GET_J,RQM._L]ST(UPA :  out  access  UPDATE_PACKAGE)  ; 
entry  QUIT; 

end  UPDATE JX)LLECTOR; 


task  READER_WR  1  TER  ( INTEGER  range  1..N)  is 

--this  task  performs  read  and  update  operations  on  the  local  file 
--copy  and  provides  the  synchronization  through  locks, 
entry  INITIALIZE;  --creates  an  errpty  file  with  version  mxifcer  0. 
entry  GET_R£ADLOCK(TJD: TASK_]D) ; 

--callable  for  a  read  request  frcm  a  transaction  only 
-•after  the  site  has  entered  WARM  state, 
entry  GET_WR1TEL0CK(T1D:TASKJD)  ; 

--as  above  for  a  write  transaction  but  in  addition  the  task 
- -FILEJRECOVER  calls  this  entry  in  procedure  TRANSFTIR_FILE 
--to  write  a  a  'WARM  version  into  the  initialized  file  when  the 
--site  holding  this  task  is  in  COLD  state, 
entry  RELEASE_READLOCK(TID: TASKJD) ; 
entry  RELEASE_WR1TEL0CK(TID:TASKJD) ; 
entry  READ( . . . ) ; 

entry  WRI TE(U: UPDATE;  V: VERSION ^NUMBER;  FV rout  VERS1 ONJJUMBER)  ; 
--if  the  current  version  number  is  0.  the  update  is  performed 
--and  the  value  of  V  is  returned  in  FV.lf  the  current  version 
--mnfcer  is  not  0  and  if  V  is  not  equal  to  one  ncre  than  the 
--current  version  mirfcer,  the  procedure  sinply  returns  with 
--the  current  version  nunber  in  FV.  Else,  the  update  is 
--performed  and  the  version  nixrber  incremented  by  one,  and 
--the  procedure  returns  the  new  version  nuni>er  in  FV. 
entry  GET_VERS10N_NUMBER(V:  out  VERS1  ON JJUMBER) ; 

end  READER Jtttl TER; 


task  LATEST_VERS 1 ON _N0 _REQ_JiANDLER( 1 NTEGER  range  1..N)  is 

-•this  task  executes  when  the  local  copy  is  in  PRIMARY  state 
-•and  provides  the  latest  version  muter  of  the  file  to  any 
--site  trying  to  enter  HOT  state. 

entry  H0T_VERS10N_N0JlEQ(N0;SlTE_lD;V:out  VERS  1 ONJJUMBER) ; 
entry  HANDSHAKE (NOD: S1TE_1D;V: VERS10N_NUMBER; 

ST: out  (SUCCESS, FAILURE) ) ; 

end  LATEST_VERS  I  ON JJOJtEQJiANDLER ; 


task  UPDATES_C0NS0L1DAT0R( INTEGER  range  1..N)  is 


--this  task  does  the  merging  of  buffered  update_Jists 
--into  the  local  file  copy  when  in  WARM  state, 
procedure  WAKE_AT(V:  VERS  1 ON_NLMBER) ; 
entry  'WAKEUP; 
entry  QUIT; 

end  UPDATES_CONSOL 1 DATOR ; 


task  UPDATE_HANDLER( INTEGER  range  1..N)  is 

--this  task  along  with  peer  tasks  in  the  PRIMARY  and 
--HOT  sites  and  the  transaction  coordinator  atcmically 
--updates  the  HOT  and  PRIMARY  copies, 
entry  UPDATE (U: UPDATE ) ; 
entry  BECOM 1 NG JPR 1 MARY ; 

end  UPDATE  JiANDLER; 


task  UPDATE _L 1  ST _BR0 ADC AS TER ( INTEGER  range  1..N)  is 

--this  task  does  the  broadcasting  of  update-lists 
--in  PRIMARY  state. 

entry  P1CKUP_PACKAGE(UP:UPDATE_PACKAGE) ; 
end  UPDATE J,  1  ST _BROADCASTER ; 

task  BROADCAST_JIMER( INTEGER  range  1..N); 

--this  task  designates  the  times  for  the  periodic  update-list 
- -broadcasts . 

task  UPDATESJT0_BE_BR0ADCASTJM1NTA  I  NER(  INTEGER  range  1..N)  is 

--fron  the  time  that  the  site  holding  this  task  starts  receiving 

- -update- 1 i sts .  this  task  maintains  a  list  of  updates  that  it  is 

--not  certain  have  been  received  by  all  appropriate  sites;  it 

--provides  the  update- lists  to  be  broadcast  in  PRIMARY  state. 

entry  REPLACE ( UP : UPDATE_PACKAGE ) ; 

entry  ENTER! NGJiOT ; 

entry  DELETE JJPTO ( VN :  VERS  I ONJWMBER) ; 

entry  PREPARE  .PACKAGE ; 

entry  ADD JJPD (U : UPDATE ; V :  VERS  1  ON_NUMBER)  ; 
entry  BROADCAST_ON_REACHINC(V: VERS  1  ON JJUMBER) ; 

end  UPDATES _J0_PE_3R0ADCAST_>(A  I NTA1NER; 


Next  we  specify  the  task  bodies. 
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task  body  F1LEJ?EC0VER  is 

LNI :  constant  :=  FI LE_RECOVER’ INDEX; 

TREC:  constant  TIME; 

LOCAL_STATE:  SI  TESTATE; 

procedure  BROADCAST_STATE( S : SI TEJ3TATE)  is 

-•used  to  broadcast  state  transitions; 

begin 

for  1  in  SITE_JD  loop 
i f  LN1  *  I  then 
begin 

STATUS JtEPORT_RECElVER(  1 )  .  STATUS _REPORT(  (LN1 . S.  TREC)  )  ; 
exception 

when  TASK1NG_ERR0R=>; -- ignore  exception  if  call 

--does  not  carplete 

end; 
end  if; 
end  loop; 

end  BR0ADCASTJ5TATE; 

procedure  F] LEJTRANSFER  (V: VERS ION JJUMBER)  is 
--This  procedure  obtains  a  copy  of  the  file  'warmer*  than  V. 

--It  uses  a  WARM  site  in  preference  to  a  HOT  or  the 
--PRIMARY  site  to  avoid  interference  in  updating  them. 

--It  will  not  return  if  all  the  sites  which  received 
--the  update  corresponding  to  version  number  V  and  ahead 
--of  the  caller  in  STATE_fl  fail  before  the  transfer  ccrrpletes. 
--In  this  case,  seme  site  not  in  HOT  or  PRIMARY  state  will  find 
--itself  at  the  head  of  STATE_$  and  invoke  rranual  intervention. 


end  FI LE_JRANSFER; 


begin 

initiate  COPY_STATUSJCEEPER(LNI ) ; 
initiate  STATUS_REP0RT_RECE1  VER(LNI ) ; 
initiate  READER JfR 1 TER(LN1 ) ; 

LOCAL  JST ATE : =COLD ; 

TREC : =READCLOCK ; 

COPY_STATUS  JCEEPER( LNI ) . UPDATE ( ( LNI , LOCAL _STATE . TREC) ) ; 

--initialize  the  queue  eloment  for  the  site 
--in  which  the  task  resides,  in  STATUS_fi. 
initiate  STATUS JtEPORT_gENDER( LNI ) ; 

-•after  the  above*  ini t ial i zat ion, status 
--requests  can  be  replied  to. 

BROADCAST_STATE(COLD) ;  --inform  all  file-copy-bearing  sites  of 

--local  status. 
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initiate  S1TE_CRASH_PETECT0R(LN1 ) ; --this  task  initially  places 

- -VATCHDOWXs  on  all  other 
--file-copy-bearing  sites. 

for  NOD  in  S1TE_)D  loop 
i  f  NOD  *  LN1  then 

STATUS  JIEPORT  _J5ENDER(N0D) . STATUS _REQUEST(LN1 ) ; 
end  if;  --to  all  other  file-copy-bearing  sites 

end  loop;  --send  status  requests. 

C0PY_J5TATUS_XEEPER( LN1 )  .WA1T_J1LL_JN1T; 

--at  this  point  the  state  of  all  sites  is  initialized  in 
--STATE_Q;  either  WATCKDOWN  on  them  has  returned  or  they 
--have  returned  status  reports. 

declare--in  this  block  a  WARM  copy  is  obtained. 

1NV:VERS10N_NUMBER; 

begin 

initiate  UPDATE_C0LLECT0R{LNI ) ; 

ini  t  iate  UPDATES_J0_BE_J3R0  ADCASTJAA 1 NTA1  NER( LN 1 )  ; 

initiate  UPDATE_RECE1 VER(LN1 ) ; 

UPDATE_COLLECTOR ( LN ] ) .  GET_F1RST_VERSJ 0N_NUMBER( 1NV) ; 

FILE  _JRANSFER( 1NV) ; 
end; 

LOCAL_STATE:=WARM; 

BR0ADCAST_j5TATE(WARM) ;-- inform  al 1  other  sites  of 

--the  state  transition; 

C0PY_ST  ATUS  JKEEPER ( LN 1 ) . UPDATE ( (LN1 . LOCAL  _j5TATE , TREC) ) ; 

--update  STATE _ft; 

initiate  UPDATES_C0NS0L 1 DAT0R( LN] ) ; 

C0PY_STATUS_KEEPER( LN 1 ) . WA 1 T_J0 _ENTER_HOT ; 

--wait  terminates  when  the 

-•site  holding  the  task  enters  one  of  the 

--first  P  positions  in  STATE_ft. 

declare  --in  this  block  the  site  makes  its  file  copy 
--correspond  with  the  latest  version. 

P:S1TE_1D; 

BDONE:  BOOLEAN : =FALSE ; 

DONE . RES 1 . RES2 : ( SUCCESS . FA 1 LURE)  : =FA 1 LURE ; 

H0T_VN :  VERS  1 ONJJUMBER; 
begin 

while  RES1=FA1 LURE  loop 

--  loop  till  successful  handshake  occurs, 
whi le  RES2=FA1LUR£  loop  --loop  till  latest  version 

--  number  is  obtained. 

begin 

C0PY_STATUS_KEEPER(LN1) ,GET_PR1MARY_1D(P) ; 
LATEST_VERS 1 0N_N0_REQ_HANDLER(P) . 

H0T_VERS 1  ON _N0_REQ(LN1 . H0T_VN) ; 

RES 2 . =SUCCESS ; 
except i on 

vdien  TASK1NG_ERR0R  =>;  --continue  inner  loop, 
end; 


end  loop; 

UPDATES_CONSOL 1 DATOR( LN 1 )  .WAKE_AT(HOT_VN) : 

--wait  terminates  when  the  version  nurber  reaches 
--HOT_VN. 

if  not  BDONE  then  BROADCASTJSTATE(HOT) ;  BDONE :=TRUE;  end  i f : 

—  inform  sites  ahead  in  STATE_Q  that  the  site  is 
--entering  HOT  state  if  this  has  not  been  done  in 
--previous  iterations  of  the  loop. 

begin 

LATEST_VERS  1  ON_NO_REQ_HANDLER  ( P ) . 

HANDSHAKE  ( LN  1 .  HOT_VN ,  DONE ) ; 
i f  DONE=SUCCESS  then 

RES 1 ; =SUCCESS ;  --handshake  terminates  successfully, 
else  RES1 :=FA1 LURE; --handshake  fails, 
end  i f ; 
exception 

when  TASK1NG_ERR0R  => 

--the  primary  has  failed  since 

-•it  supplied  the  latest  version  nurrber. 

RES 1:=FA1 LURE; 

end; 

en-*  loop; 

UPDATES_COLLECTOR ( LN I )  .  QU 1 T ; 

UPDATES_CONSOL 1 DATOR( LN 1 ) . QU 1 T ; 

UPDATEJtECEJ  VER( LN1 ) . ENTER 1 NG JiOT ; 
UPDATESJT0_BE_BR0ADCAST_MA1NTA1NER(LN1 ). ENTER! NGJiOT; 
end ; 

LOCAL_STATE : =HOT ; 

COPY_gTATUS  J<EEPER( LN1 ) . 

UPDATE ( ( LN 1 . LOC ALLSTATE , TREC) ) ; - - upda t  e  STATE_Q . 
initiate  UPDATE_HANDLER( LN 1 ) ; 

C0PY_J5TATUS_KE£PER(  LNI ) .  WA1  T_JO_BECCME_PRIMARY ; 

UPDATE_RECE1VER(LNI ) .QUIT; 

UPDATE _JiANDLER( LNI). BECOM I NG_PR 1 MARY . 
initiate  LATEST_VERS 1 0N_N0_REQ_J1ANDLER( LN 1 ) ; 

--respond  to  requests  for  the  latest  version  nmt>er 
--from  sites  wanting  to  enter  HOT  state, 
initiate  UPDATE_LIST_3R0ADCASTER(LNI ) ; 

.--ccmnence  periodic  broadcasts  of  update  lists 
initiate  BROADCASTJT 1 MER( LNI); 

LOCAL_STATE : =PR 1 MARY ; 

C0PY_STATUS_KEEPER(LN1 ) . 

UPDATE( (LNI, LOCAL_?TATE , TREC) ) ; -  - upd  a  t  e  STATE_Q . 

BROADC AST_ST ATE (PR 1MARY) ; --inform  all  other  file-copy  bearing 

--sites  of  state  transition. 

end  FILE  ^RECOVER; 


task  body  STATUS JtEPORT _J5ENDER  is 

LN 1 : cons t an t  SI TE_1  D : =STATUS  REPORT_gENDER’ 1 NDEX ; 
S:S1TE_STATE; 
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T:  TIME; 

beg  i  n 
loop 

accept  STATUS JttlQUEST ( NOD :  S1TE_JD)  ; 

COPY  .STATUS ^KEEPER  ( LN 1 )  .  GET_STATUS(LNI .  S.  T)  ; 
beg  i  n 

STATUS  _R£PORT_RECE 1 VER ( NOD ) . STATUS  .REPORT ( ( LN 1 . S . T ) ) ; 
exception 

when  TASK1 NGJIRROR  =>; 
end; 

end  loop; 

end  STATUS JREP0RTJ5ENDER  ; 


task  body  STATUS _REP0RT_RECE1  VER  is 

begin 

loop 

accept  STATUSJREPORT(T: TRIAD)  do 
if  ( T . STATE=COLD)  then 

S1TE_CRASHJ)ETECT0R(LN1 ) . SITEUP(T.NO) ; 

--T.NO  is  back  up,  so  a  WATCHDOWX  should 
--be  placed  on  it. 

end  i f ; 

COPY_ST ATUS_KEEPER(  LN  1 ) . UPDATE ( T ) ; -  - upda t  e  STATE _Q . 
end  STATUS .REPORT; 
end  loop; 

end  STATLTS_REPORT_RECEIVER; 


task  body  S 1 TE_CRASH_PETECTOR  is 

LN1: constant  S1TE_1D  :=  S 1 TE_CRASH_PETECTOR ’ 1 NDEX ; 
begin 

for  NOD  in  S1TE_ID  loop  --place  WATCKDOWN  on  file  copy 

--bearing  sites. 

if  (NOD  *  LSI )  then 
WATCHDOWN (NOD) ; 
end  i f ; 
end  loop; 
loop 

•elect 

accept  WATCH_JXJWN_J  NTERRUPT ( NOD ;  S 1  TE_1  D ;  T :  T I  ME ) ; 
--this  entry  is  invoked  by  the  status 
--maintenance  scheme  when  a  watch  returns. 

COPY  _gTATUS_KEEPER ( LN 1 ) . UPDATE ( (LN1 .DEAD.T) ) ; 

or 

accept  S1TEUP(N0D;S1TEJD); 

WATCHDOWN (NOD) ; 
end  select; 


end  loop; 


1 

end  SlTE_CRASH_pETECTOR; 


task  body  UPDATE .RECEIVER  is 
HOT ; BOOLEAN : =FALSE : 

LNI :  constant  SITEJD  :=  UPDATES_RECE I VER ‘  1 NDEX ; 
beg  i  n 
loop 
select 

when  not  HOT=> 

accept  UPDATE_LIST(UP:UPDATE_PACKAGE) : 

UPDATESJTOJE .BROADCAST _MA  1 NTA 1  NER( LN I  )  .REPLACE (UP) ; 

--receipt  of  this  list  may  modify  the  updates  that 
--are  known  to  have  been  broadcast. 
UPDATE_COLLECTOR{  LN  I ) .  ADD_T0_L1ST(UP) ; 

--add  this  list  to  buffered  update- 1  i  sts  . 
or 

accept  ENTERlNGJiOT; 

HOT : =TRLE ; 
or 

when  HOT=> 

accept  UPDATE_L1ST(UP;UPDATE_?ACKAGE) ; 

UPDATES  _T0  _PE_BROADCAST JAA 1  NTA  1  NER(  LNI). 

DELETE  JJPTO( UP . LVN- 1 ) ; 

--updates  upto  version  nuntfcer  UP.LVN-1  must  have 
--already  been  broadcast, 
or 

accept  QUIT; - -PRIMARY  state  is  being  entered,  from  now  on 
--the  site  holding  this  task  will  do  the 
--broadcasting  of  update- 1 i sts . 

exit; 

end  select; 
end  loop; 

end  UPDATES_RECE I  VER ; 


task  body  UPDATE_COLLECTOR  is 

LNI ; constant  SITE J D: =UPDATE_£OLLECTOR' INDEX; 
type  LISTJELEM  is  record 

LUP : UPDATE .PACKAGE ; 

SUCC: access  LJST_ELEM; 
end  record;  - -buffer  for  update-list. 

HP.  TP:  access  LISTJILEM; 

M 1 N_VN . MAX_VN ; VERS  1 ON_NUMBER ; 

--M1N_VN  is  the  first  update  received  after 
--entry  into  COLD-  s tate ;MAX_VN  is  the  latest 
--update  received. 

WAITING: BOOLEAN: =FALSE; --signifies  when  TRUE  that  the  task 


--UPDATES_CONSOLIDATOR  is  waiting  for 
--the  next  update  list. 

beg  i  n 

accept  ADD__TO _U 1  ST ( UP : UPDATE_P ACKAGE )  ; 

KP:=TP:=new  LIST _£LEM(LUP=>UP. SUCC=>nul 1 ) ; 

M ]  N_VX : =UP . L VN ; MAX  _VN : =UP .  HVN ; 
loop 
select 

accept  GET _T 1 RST  VERS  10N_NUMBER(VN: out  VERS  1  ON_NUMBER)  do 
VN:=M1N_VN; 

end  GET _f  1 RST_VERS 1 ON_NUMBER ; 
or 

accept  ADD_T0_L1 ST ( UP :UPDATE_P ACKAGE) ; 

if  (UP. HVN>MAX_V^)  then--if  the  new  list  contains  updates 

--not  already  received. 

if  (TP:=null)  then-- if  all  received  updates  have  already 
--been  picked  up  by  UPDATES_CONSOL 1 DATOR . 

HP : =TP : =new  L 1  ST _ELEM ( LUP=>UP , SUCC=>nu 11); 
e  1  se 

TP. SUCC:=new  L 1  ST _£LEM ( LUP=>UP . SUCC=>nu 11); 

TP : =TP . SUCC ; 
end  i f ; 

MAX_VN:=UP.HVN; 
if  WAITING  then 

UPDATES  CONSOL 1 DATOR ( LN 1 ) . WAKEUP ; 

WAITING: =TRUE;-- one  WAKEUP  for  each  time  GET_JROM_L 1  ST 
--returns  zero  updates. 

end  i f ; 
end  i f ; 


accept  GET _JROM _LIST(UPA: out  access  UPDATE_PACKAGE)  do 
if  (KP=null)  then-- if  no  updates  on  band 
UPA:=new  UPDATE_PACKAGE ( lvn=>0, hvn=>0) ; 

- -return  a  nul 1  list. 

WAITING: =TRUE ; -  - r emembe r  that  UPDATES _CONSOL I DATOR  will 
--be  waiting  to  be  informed  when  seme 
--updates  are  available. 

else 

UPA: =  new  UPDATE .PACKAGE (HP .all); 

HP : =KP . a  1 1  ; 
end  if; 

end  GET_JROM_LIST; 


or 

accept  QUIT;exit; 

end  select; 
end  loop; 


-the  site  holding  this  task  is  entering 
-HOT  state. 


end  UPDATE_COLLECTOR; 


task  body  UPDATES_C0NS0L1  DATOR  is 
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T 1 D : c  on  s  t  an  t  TASK_1  D : =CQMMON . GET_MY J  D ; 

WAITING:  =BOOLEA\:  =FALSE;  - -when  TRUE,  this  variable  signifies  that 
--this  task  is  waiting  for  more  updates  to  arrive. 

P:  access  UPDATE_PACKAGE ; 

CVN: VERS  I ON_NUMBER; - -current  version  number  of  the  local  copy. 
TRAP_VN : VERS  1  ON ^NUMBER ; 

TRAPJ5ET :  BOOLEAN :  =FALSE ; 

LN1 : constant  SI TE_J D : =UPDATES_CONSOL I DATOR ’ 1 NDEX ; 

entry  SET_TRAP(V: VERSION JJUMBER)  ; 
entry  REACHED; 

procedure  WAKEJVT(V : VERS  1 ONJJUMBER)  is 

--is  used  by  F1LE_REC0VER  to  be  informed  when  the  local  copy 
--reaches  the  latest  version  number,  so  that  it  can  enter 
--HOT  state, 
beg  in 

SET_TRAP(V) ; 

REACHED ;-- t he  latest  version  number  has  been  reached, 
end; 
beg  i  n 

READER_WK1  TER(  LN 1 )  .  GET_VERS  1  ONJJUMBER(  CVN ) ; 

«OUTER» 

loop 

while  not  WAITING  loop--loop  till  all  update  lists  received 

--have  been  merged  into  the  local  copy. 
UPDATE_COLLECTOR(  LNI ) .  GETJT?OM_J,  1  ST(P) : 
if  (P.HVN=0)  then--if  a  null  list  has  been  obtained 
WAITING: =TRUE; --then  wait  till  seme  updates  arrive, 
e  1  se 

READER_WRJ  TER{ LNI ) . GET_WR1 TELOCK(TID) ; 
for  J  in  P.LVN..P.HVN  loop--merge  updates, 
if  J=CVN+1  then 

READER_WR  ITER  (LNI  )  .WR3  TE(P.UPDATES(  J)  ,3  ,  CVN)  ; 
end  i f ; 
end  loop; 

READER_WR ITER (LNI ) . RELEASE_WR1 TELOCK(TID) ; 
end  i f ; 
end  loop; 

«1NNER» 

1  pop 
select 

accept  SET_JRAP(V:  VERSION JJIMBER)  ; 

TRAP_VN:=V; 

TRAP  _SET:=TRUE; 
or 

when  ( TRAP_SET  and  then  CVN  TRAP_VN)=> 

--version  nurber  has  reached  the  latest  version  number; 
accept  REACHED, 
or 

accept  WAKEUP ; WAITING: sFALSE; ex i t  INNER; 

--go  to  process  the  newly  arrived  updates, 
or 


accept  QUIT;  exit  OUTER; 

--site  is  entering  HOT  state, 
end  select; 
end  loop; 
end  loop; 

end  UPDATES_CONSOL 1 DATOR ; 


task  body  COPY_STATUS _)CEEPER  is 

type  S1TEJREC  is 
record 
T; TRIAD; 

PRED.SUCC: access  SITE_R£C; 
end  record; 

HOTL 1ST: COPY_SET : = ( 1 . . N=>FALSE ) ; 

S1TES_>,0T_JMTIAL1ZED:  =  (1.  . N=>TRUE)  ; 

IMT:  BOOLEAN :  =FALSE ; 

ENTER J-'OT :  BOOLEAN  =FALSE ; 

BECOME  _RRI MARY : BOOLEAN : = FALSE ; 

PR1  MARY  JOVOTN :  BOOLEAN’ :  =FALSE ; 

HP. TP: access  SI TE_REC: =nul 1 ; - -head  and  tail  pointers  to  STATE_Q. 

LOCAL_POS:  INTEGER  range  1 .  .  N;  - -pos  i  t  i  on  of  site  in  which  task 

--resides  in  STATE_Q. 

LN 1 : c on s t an t  SI TE_I D : =C0PY_2TATUS JCEEPER *  1 NDEX ; 

procedure  RETRIEVE  (N:  SITEJD;  S:  out  S I  TESTATE ;  T :  ou  t  TIME)  is 
--This  procedure  searches  the  queue  to  find  the  elffnent  for 
--the  site  corresponding  to  N  and  returns  the  values  of 
--T. STATE  and  T.TR. 

end  RETRIEVE; 

procedure  M0D1FY(T: TRIAD)  is 

--This  procedure  changes  the  queue-element  for  the  site 
--specified  in  T  to  the  values  specified  in  T  provided  T.TR  is 
--greater  than  or  equal  to  the  corresponding  ccrrponent  of 
--the  queue  element;  it  then  moves  the  elffnsnt  if  necessary 
--so  that  STATE_Q  is  still  sorted  in  increasing  order  of 
--the  value  of  this  canrponent  for  non-DEAD  sites  with  the  DEAD 
--sites  following  in  any  order. 


end  M0D1FY_REC; 

procedure  GET_L0CA1_P0S(P0S:  out  INTEGER)  is 

--This  procedure  gets  the  position  of  the  queue - e 1  &nent 
--corresponding  to  site  LN1 ,  measured  frcm  the  head  of  the 
-- queue , i . e . i f  site  LN!  were  at  the  head,  the  procedure  will 
--return  with  P0S=1 . 
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end  GET_L0CAL_P0S ; 

beg  i  n 

for  NOD  in  S1TE_1D  loop  --form  STATE_Q. 
i  f  (TP=mil  1  )  then 
HP : =TP : = 

new  SI TE_REC( t r i ad=>(NOD, DEAD, 0) , PRED=>nu 1 1 , SUCC=>nul 1 ) ; 
else 

TP  SUCC: = 

new  SlTE_REC(triad=> (NOD, DEAD. 0) . PRED=>TP . SUCC=>nu 1 1 ) ; 

TP : =TP . SUCC ; 
end  loop; 

LOCAL _POS : =LN I ; 

1  oop 
sc  1  set 

when  P R I  MARY  JCNCWN'  => 

accept  GET_PR1MARY_1D(N0D ;  out  S1TEJD)  do 
NOD:=HP.T. NO; --PRIMARY  is  at  the  head  of  STATE  . 
end  GET _PR1MARY_ID; 
when  BECOME_PRJMARY=> 

accept  VA1T_T0_JEC0ME_PR1MARY; 
when  ENTER _H0T=> 

accept  WA1T_T0_E\TER_H0T; 
when  1N1T=> 

accept  WA1TJJTLLJN1T; 
or 

when  1N1T=> 

accept  GET_HOTLlST(S:out  C0PY_SET)  do 
S:=H0TL1ST ; 
end  GET _H0TL1ST; 
or 

accept  GET_STATUS(N0D:  SI TEJD;  STATE:  out  S1TE_STATE; 

T:out  TIME)  do 

RETR1 E VE( NOD. STATE. T) ; 
end  GET  STATUS; 
or 

accept  UPDATE(T: TRIAD) ; 

MODIFY(T) ; 

if  (T.STATE=HOT)  or  (T.STATE=PR1MARY)  then 
HOTLIST(T.NO) ;=TKUE; 
end  i f ; 

if  not  1N1T  then 

S 1 TES_N0T J N 1 T 1 AL 1  ZED ( T . NO ) : =FALSE ; 
if  S 1 TES_NOT_l N1 T I AL 1 ZED= ( 1 . . N=>FALSE )  then 
1 N 1 T : =TRUE ; 
end  i f ; 
end  i f ; 

GET_LOCAL-POS( LOCAL  J>0S) ; 

if  (L0CAL_P0S  i  P)  then  ENTERJiOT: =TRUE; end  if; 
if  ( L0CAL_P0S=1 )  then 
BEC0ME_?R1 MARY : =TRUE ; 
if  (HP. T.STATE=PR1 MARY)  then 


PR1 MAKYJCNCJWN :  =TRUE ; 
else 

PR  I  MARY  JCNOKN :  =  FALSE ; 
end  i f ; 

if  1NIT  and  ( (LOCAL_?OS=l)  and  ((HP. T. STATE  *  HOT) 
or  (HP. T. STATE  *  PRIMARY))  then 
(signal  and  wait  for  manual  intervention  to  reinitialize 
the  file  copy  bearing  sites 

end  if; --the  site  at  the  head  of  STATE_Q  is  unable  to  enter 
--HOT  state,  therefore  no  further  update 
--transactions  can  be  processed. 

end  select; 
end  1 oop ; 

end  COPY _J5TATUS J<EEPER ; 


task  body  UPDATE ..HANDLER  is 

LNJ :  constant  S1TEJD:=UPDATE_HANDLER‘  INDEX; 

HOT : BOOLEAN :=TRUE; --flag  to  di st ingui sh  whether  the  state 

--is  HOT  or  PRIMARY. 

CVN: VERS10S_}RJUBER; --current  version  nirrber. 

T1 D : cons  t  an t  TASKJ  D : =COMMON . CETJrfY J D ; 

TR_1D:  TRANSACT  10N.JD; 

COUNT:  INTEGER  range  1..N; 

READY: BOOLEAN; 

UP: UPDATE; 
begin 

READER JfR]  TER(  LN 3  )  .  GETJ/ERS  3  ON_>TJMBER(  CVN )  ; 

«MA1N» 

loop 

select 

accept  UPDATE (U_JD: TRANSACT I ON_I D ; U : UPDATE ; 

AM_PR3MARY : out  BOOLEAN ; RES : ou t  ( ACCEPT, RE J ECT ) ; 
HL: out  COPY_SET)  do 

--protection  and  integrity  checks  are  assured 
--to  have  been  done. 

TR_JD:=U_JD; 

UP : =U ; 

READER_WR3TER(LN3 )  .  GET_WR] TEL0CK(T3D) ; 
if  HOT  then  --in  HOT  state 
AM _PR]MAEY :=FALSE, 

RES : =ACCEPT ; 

READY  :=TRUE; 
else  --in  PR1MAI?Y  state 
AM_PR I MARY : =TRUE ; 

COPY_STATUSJCEEPER(LNJ)  ,GET_HOTLJST(HL) ; 
for  1  in  SJTE_]D  loop 
if  KL( 1 )=TKUE  then 
COUNT :=C0UNT+1; 
end i f ; 
end  loop; 

if  COUNT  *  P  then  --accept  update  only  if  P  HOT  copies 


99 


-•(including  the  PRIMARY)  exist. 

READY :=FALSE; 

RES: =REJECT; 

READER_WR1  TER( LN1 )  .RELEASE JA7R1TEL0CK(T ID)  ; 
else 

READY :sTRUE; 

RES : =ACCEPT ; 
end  i f ; 
end  i f ; 
end  UPDATE; 

«COMM  1  T_OR_ABORT» 
i f  READY  then 
loop 

accept  DECISION  (U J D : TRANSACT 1 ON_J D ; COMMIT: BOOLEAN)  do 
if  TR_ID=U_JD  then 
if  COMMIT  then 

READER JVRI TER(LNI ) . WR] TE(UP.  CVN+1 . CVN) ; 
UPDATESJTO _PE_3R0ADCAST_MA1NTA1NER(LN1 ) . 
ADDJJPD  ( UP .  CVN ) ; 

end  i  f ; 

READER JWR1  TER( LNI  )  . RELEASE JWtl TELOCK(TID)  ; 
exit  COMM  3  T_OR_JtBORT ; 

--exit  this  loop  only  if  the  fate  of  the 
--transaction  has  been  decided, 
end  i f ; 
end  DECISION; 
end  loop; 
end  i f ; 
or 

accept  BECOM I NG_PR1 MARY ; 

HOT : =FALSE; 
end  select; 
end  loop; 

end  UPDATE JiANDLER; 


task  body  LATEST_VERS I  ON _N0  JREQJiANDLER  is 

LNI : constant  S I TE J  D ; =LATEST_VERS I  ON _REQ_HANDLER’ INDEX; 

NOD :  S 1  TE_J  D ; 

ST: (SUCCESS, FAILURE); 

VN :  VERS  1  ON  JNUMBER ; 

T I D : TASK J  D : =GET_MY_1  D ; 

T:  constant  T1ME:=. -should  be  set  to  a  value  sufficient  for 

-•the  site  requesting  the  latest  version 
--ninfcer  to  get  all  updates  up  to  this 
--version  nunber,  inform  all  sites  of  its 
--entry  into  HOT  state  and  then  perform  a 
--a  handshake  with  this  task. 

begin 

loop 

loop 


✓  / 
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select 

accept  H0T_VERS10N_N0_PEQ(N0:SlTEJD;V:out  VERS  1 ONJJUMBER) 

do 

NOD : =NO ; 

READER JVR  I  TER( LN1 ) .  GET_READL0CK(T1D) ; 

•-block  updates  while  the  process  of  getting  the 
--caller  site  into  HOT  state  is  going  on. 

R£ADER_WR1 TER{ LN 1 ) . GET_VERSIONJJUMBER(VN) ; 

V:=VN; 

end  H0T_VERSI0N_N0J?EQ; 

UPDATES_JO_PE_JROADCAST_MA 1 NTA 1 NER( LN 1 ) . 

BROADCAST_ON_REA CH 1 NC ( VN ) ; 

-•initiate  a  quick  update  list  broadcast  so  that 
--the  caller  site  does  not  have  to  wait  till  the 
--next  of  the  periodic  broadcasts, 
exit;  --go  to  wait  for  handshake; 
or 

accept  HANDSHAKE  (NO: SI TE_1D;V: VERSION JJUMBER; ST: out 
(SUCCESS. FAILURE))  do 

ST:=FA1LURE; - -a  HANDSHAKE  entry  accepted  here  indicates 
--that  the  call  did  not  come  in  time. 

end  HANDSHAKE; 
end  select; 
end  loop; 

loop 

BC  accept  HANDSHAKE (NO : S I TE_J D ; V : VERS 1 0N_NUMBER;  ST : out 

(SUCCESS. FAILURE))  do 

if  (NO=NOD)  and  (V=VN)  then 
ST:=SUCCESS;  --succesful  broadcast 
READER_WR1TER(LN1 ) ,RELEASE_WR]TEL0CK(T1D) ; 
exit;  --go  to  wait  for  the  next  request  for  the  latest 
--version  number. 

else 

ST:=FA1LURE;  --this  call  did  not  care  within  the  set 

-•period  or  is  a  duplicate  of  a  call  that 
--either  did  not  cone  in  time  or  which 
--resulted  in  a  successful  handshake. 

end  i f ; 

end  HANDSHAKE; 


delay  T;  --time  period  for  the  site  that  requested  the 
--latest  version  number  to  call  HANDSHAKE. 
READER _WR 3 TER(LN1 ). RELEASE  JVR] TEL0CK(T ID) ; 
exit; 

--expected  handshake  did  not  occur,  so  go  back  to  wait 
-•for  the  next  request  for  the  latest  version  number, 
end  select; 
end  loop; 
end  loop; 


end  UPDATE JiANDLER ; 
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task  body  UPDATER  1  ST_J3R0ADCASTER  is 


LN1  rconstant  S ] TE_] D :  =UPDATES ] ST_j3R0ADCASTER '  INDEX; 

begin 

loop 

accept  P1CXUP_PACKAGE(UP:UPDATE_PACKAGE) : 
for  all  NOD  in  S1TE_JD  loop 
if  LN1  *  NOD  then 
begin 

UPDATES  1 ST_RECE 1 VER ( NOD ) . UPDATES  1  ST (UP ) 

except i on 

when  TASX1NG_ERR0R=>; 
end; 
end  i  f ; 
end  loop; 

end  UPDATED  ISTJROADCASTER; 


task  body  BROADCASTS  1 MER  is 

LN1  :  constant  S 1  TE_1  D :  =BRO  ADCASTS 1  MER  *  INDEX; 

TBROD: constant  T1ME:=....;  --period  of  update  lists  broadcasts, 
begin  loop 

delay  TBROD; 

UPDATES_XO_j3E_J3ROADCAST_MA  1 NTA I  NER(  LN 1 ) .  PREPARES ACKAGE ; 
end  loop; 

end  BROADCASTS  I  MER: 


task  body  UPDATESS0_BE_j3R0ADCAST_MA  1 NTAI NER  is 
LM  :  constant  SI  TE_1  D :  =UPDATESSO_BROADCAST_MA  1  NTA  I  NER 1 1 NDEX ; 

LUPV , V : VERS 1 ONJJUMBER;  --LUPV  is  the  version  nuifcer  up  to  which 

--updates  have  already  been  broadcast. 
P.Q: access  UPDATESACKAGE ; 

LO .  H 1 :  VERS  1 ONJJUMBER ; 

HOT : BOOLEAN : =FALSE ; 

TRAP.J3ET: BOOLEAN :=FALSE;  --used  to  initiate  special  broadcasts 

--when  a  site  entering  HOT  state. 
TRAP_VN  _N0 :  VERS  1  ON JJUMBER ; 

NO_UPD_ON JiAND : BOOLEAN : =TRUE ; 
beg  i  n 

UPDATE_COLLECTOR(LNI ) .  GETS  1 RSTSERS I  ON_NUMBER(V) ; 

LUPV;=V-1; 

loop 

select 

when  not  HOT=> 

accept  REPLACE(UP:  UPDATESACKAGE) ; 
if  NO_JJFD_PNSAND  then  --if  no  updates  are  in  this 

•-site's  possession  which  have 
--not  been  broadcast 

if  (UP.HVN>LUPV)  then  --if  this  list  contains  any 
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••updates  not  known  to  have 
--been  broadcast 
N0_UPD_0N  JiAND : =FALSE ; 

P:=new  UPDATE_PACKAGE ( LVN=>LUPV+ 1 . HVN=>UP . HVN) ; 
for  VN  in  P. LVN.. P.HVN  loop  --store  the  list. 

P.UPDATES(VN) : =UP. UPDATES (VN) ; 
end  loop; 
end  if; 

else  --this  site  has  seme  updates  that  it  cannot  be  sure 
--have  been  broadcast 

if  (UP.HVN>P.HVN)  then  --if  this  list  contains  seme 

--new  updates 

Q:=new  UPDATE_?ACKAGE(P. al  1 )  ; 
if  (P. LVN>UP. LVN)  then 
LO: =P. LVN; 
else 

LO ; =UP . LVN ; 

end  if;  --updates  with  version  number  less  than 
--L0  are  known  to  have  been  broadcast. 

HI  :=UP.HVN; 

LUPV:=L0- 1 ; 

P :  mew  UPDATE_PACKAGE( LVN=>L0 . HVN=>HI )  ; 
for  V  in  LO. .HI  loop 

if  V  in  Q.LVN. .Q.HVN  then 
P.UPDATES(V) :=Q.UPDATES(V) ; 

*  pCUPDATES(V) : =UP. UPDATES(V) ; 
end  i f ; 

end  loop;  --merge  the  newly  arrived  list  with  the 
-•updates  on  hand. 

end  i f ; 
end  i f ; 

when  not  H0T=> 

accept  ENTER] NG_H0T ; 

H0T:=TKUE;  --from  now  on  updates  are  directly  added  to 
-•the  set  on  hand  as  they  are  committed, 
-•instead  of  being  received  in  periodic 
--broadcasts. 


when  HOT  and  not  TRAP _3ET=> 

accept  BR0ADCAST_0N_REACH 1 NG ( V : VERS 1 ONJJUMBER) ; 
if  (N0JJPD_0N_HAND  and  LUPV  <  V)  or 

((not  NO_JJPD_ONJIAND)  and  P.HVN  <  V)  then 

--do  not  set  trap  if  the  update  corresponding  to  V 
-•has  already  been  broadcast  or  if  not  is  in  the 
--the  set  of  updates  on  hand. 

TRAP_SET : =TRUE ; 

TRAP_VN_NO:=V; 

else  --if  the  update  has  not  been  broadcast  but  is  at 
••hand  then  initiate  a*  broadcast, 
if  ((not  NOJJPDJDNJiAND)  and 

P. LVN  <  V  and  P.HVN  fc  V  )  then 
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UPDATEJ,  I  STJ3R0ADCASTER(LN1 )  .  PI CKUP  J>ACKACE(P .  all) 
NOJJPDJDN  JIAND : =TRUE ; 

LUPV : =P . HVN ; 
end  if; 
end  i f ; 
or 

when  HOT=> 

accept  ADDJJPD (U: UPDATE; V; VERS  I ON  J1UMBER)  ; 
if  (NOJJPDJDN JiAND  and  V=LUPV+1)  then 

P:=new  UPDATEJSACKAGE(LVN=>V.HVN=>V,UPDATES(V)=>U) ; 
NOJJPDJDN  JiAND :  =FALSE ; 

elseif  ((not  NOJJPDJDN  JiAND)  and  V=P.HVN+l)  then 
Q : =new  UPDATE_PACKAGE(P. all); 

P:=new  UPDATE J’ACKAGE ( LVN=>P . LVN , HVN=>V ) ; 
for  VN  in  P.LVN..P.HVN  loop 
P. UPDATES (VN) :=Q. UPDATES (VN) ; 
end  loop; 

P.UPDATES(V) :=U; 
end  if; 

if  (TRAP_5ET  and  then  P.HVN  i  TRAPJ/N JJO)  then 

--if  the  trap  is  set  and  the  new  update  sets  it  off 
UPDATE  J>  1 ST J3ROADCASTER ( LN 1 )  .  PI  CKUPJ>ACKAGE(P.  all); 

--initiate  a  broadcast. 

NOJJPDJDN  JiAND :  =THUE ; 

LUPV ; =P . HVN ; 

TRAPJ5ET :  sFALSE : 
end  if; 
or 

when  HOT=> 

accept  PREPARE  J’ACKAGE ;  --periodic  broadcast 
if  (not  NOJJPDJDN  JiAND)  then 
UPDATE  J 1  ST  J3R0ADCASTER  ( LN  I ) .  P 1  CKUPJ’ACKAGE  (  P .  a  1 1 ) ; 

NO  JJPDJDN JiAND :  =TRUE ; 

LUPV :  «P . HVN ;  end  if; 
or 

when  H0T=> 

accept  DELETE JJPT0(V : VERS  1  ON JJUMBER) ; 
if  (not  NOJJPDJDN  JiAND)  then 
if  (P.HVN  <  V)  then 
LUPV : =V ; 

NO  JJPDJDN  JiAND :  =TRUE ; 
elseif  (P.LVN  £  V)  then 

Q:=new  UPDATEJ>ACKAGE(P. el  I  )  ; 

P : =new  UPDATE JACKAGE ( LVN=>V+ 1 , HVN=>Q . HVN) ; 
for  VN  in  P.LVN.. P.HVN  loop 

P. UPDATES (VN) ;=Q. UPDATES (VN) ; 
end  loop; 
end  if; 
end  if; 
end  select; 
end  loop; 

end  UPDATES  JT0J3E BROADCAST  JfA  1  NT  A 1 NER ; 
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CHAPTER  3 


ENSURING  THE  CORRECTNESS  OF  GLOBAL  INFORMATION 


3.1.  Introduction 

In  this  chapter,  we  address  the  problem  of  maintaining  the  availability  of 
global  information  in  a  computer  network,  in  the  presence  of  malfunctioning 
sites  in  the  network. 

Our  model  of  the  network  is  that  it  consists  of  a  set  of  sites  attached  to 
a  communication  subsystem.  We  assume  that  this  subsystem  provides  per¬ 
fect  site  -to  -site  communication  so  that  all  messages  are  delivered  intact  in 
a  known  period.  Note  that  in  this  model,  the  communication  subsystem  does 
not  provide  a  reliable  broadcast  mechanism  and  in  fact  the  difficulty  of  per¬ 
forming  a  reliable  broadcast  will  be  a  major  issue  in  the  following  discussion. 
Further  it  is  assumed  that  no  site  A  can  masquerade  as  another  site  B  and 
send  messages  as  originating  from  B.  The  ideas  presented  below  can  be 
extended  to  the  case  where  the  communication  is  imperfect;  the  assumption 
of  perfection  is  made  to  simplify  the  presentation. 

In  Chapter  1,  we  distinguished  between  two  kinds  of  failure  models  for 
network  sites.  In  the  model  in  which  crashes  are  the  only  mode  of  failure,  a 
site  exhibits  fail -stop  behavior  [SCH  B3]  and  performs  a  recovery  pro¬ 
cedure  as  its  first  act  after  each  crash.  In  the  other  model,  malfunctions 
are  the  mode  of  failure.  A  malfunctioning  site  may  go  through  arbitrary 
state  transformations  and  emit  arbitrary  messages.  In  the  extreme  case,  a 
malfunctioning  site  may  exhibit  malicious  intelligence  attempting  to  disrupt 
the  functioning  of  the  rest  of  the  network.  In  Chapter  2,  we  showed  how  to 
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detect  site  crashes  and  maintain  a  view  of  network  status  for  the  former 
model.  The  essence  of  malfunction  as  a  model  of  failure,  however,  is  that  the 
existence  of  a  malfunctioning  site  may  go  undetected  for  an  indefinite  period 
of  time.  Hence,  it  is  necessary  to  develop  techniques  that  preserve  the  avai¬ 
lability  of  global  information  in  the  presence  of  arbitrary,  undetected 
failures,  and  this  is  the  aim  of  this  chapter. 

In  Section  3.2.,  we  discuss  why  and  under  what  conditions  replication 
should  be  used  to  deal  with  malfunctions  Section  3.3.  explains  the 
phenomenon  of  error  propagation  which  can  occur  when  malfunctioning  sites 
are  present,  and  which  can  progressively  render  all  the  global  information 
stored  in  the  network  incorrect.  Section  3.4.  outlines  our  approach  to 
preserving  correctness.  Section  3.5.  deals  with  relevant  past  work  in  this 
area  and  the  reasons  why  it  cannot  be  directly  used  to  solve  the  problem 
being  considered.  Section  3.6.  describes  the  extensions  to,  and  modifications 
of  this  work  necessary  to  carry  out  our  approach.  Section  3.7.  further 
develops  our  approach  by  describing  protocols  that  prevent  error  propaga¬ 
tion  when  a  particular  form  of  bound  on  the  number  of  malfunctioning  sites 
holds,  and  which  have  some  nice  properties. 

3.2.  Redundancy  Techniques  for  Storing  Information 

In  order  to  protect  the  availability  of  information  against  crashes  and 
malfunctions,  some  form  of  redundancy  is  required.  One  form  of  redundancy 
is  replication,  in  which  multiple  copies  of  the  information  are  made  and 
stored  with  each  copy  at  a  different  site.  Another  form  of  redundancy  that 
could  be  used  is  error-detecting  and  correcting  codes  Consider  an-bit  piece 
of  information.  It  could  be  encoded  in  A^=n+A  bits,  where  *=[log2A^]  using  a 


distance-3  Hamming  code  and  stored  in  N  sites,  one  bit  in  each  site  Then 
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the  piece  of  information  would  remain  available,  as  long  as  no  more  than  one 
site  crashes  or  malfunctions.  However  this  solution  will  incur  a  large  amount 
of  communication  overhead,  since  a  large  number  of  sites  may  have  to  be 
consulted  to  retrieve  the  information.  Also  since  the  information  is  parti¬ 
tioned  among  many  sites,  it  is  not  possible  to  process  it  locally  at  any  of 
these  sites.  Rather,  the  information  must  be  first  assembled  at  some  point 
before  processing,  further  increasing  the  communication  overhead.  Since 
communication  bandwidth  is,  and  is  expected  to  remain  [OUS  80],  a 
bottleneck  in  most  distributed  systems,  we  do  not  consider  this  approach 
further.  Thus  although  error-deterting  and  correcting  codes  can  be  used 
locally  at  each  site  e.g.  in  the  memory  and  ALU,  to  lessen  the  likelihood  of  its 
crashing  or  malfunctioning,  and  also  in  the  communication  subsystem,  the 
appropriate  redundancy  technique  for  stored  information  at  the  system  level 
where  the  unit  of  failure  is  a  site,  is  to  replicate  the  information  at  multiple 
sites. 

In  order  to  preserve  the  availability  of  an  item  of  information  in  the 
presence  of  m  malfunctioning  sites,  it  is  necessary  to  replicate  the  informa¬ 
tion  at  2m  +  1  sites.  Then  by  consulting  each  of  the  2m +  1  replicas  and  taking 
a  majority  vote,  the  correct  value  is  obtained,  as  long  as  there  are  at  most  m 
malfunctioning  sites.  The  larger  the  number  of  replicas,  the  greater  the  pro¬ 
tection  against  the  information  getting  lost  due  to  malfunctioning  of  sites. 
But  this  is  true  only  if  the  probability  p  of  a  site  malfunctioning  is  less  than 
0.5.  If  p  >  0.5,  replication  only  reduces  the  probability  of  obtaining  the 
correct  value.  Assume  the  number  of  replicas  is  increased  from  2r-l  to 
2r  +  l.  There  are  two  possible  situations  in  which  this  addition  of  two  extra 
replicas  makes  a  difference  in  determining  whether  a  majority  vote  yields 
the  correct  value  or  not: 


(a)  There  are  r— 1  malfunctioning  sites  among  the  original  2r-l  sites,  but 
both  the  additional  sites  are  malfunctioning.  Here  a  majority  vote  with  the 
original  2r-l  replicas  will  yield  the  correct  value,  but  a  majority  vote  after 
the  addition  of  the  two  replicas,  will  not.  The  probability  of  this  occurring  is 

pi=Peprr_"1V'^i-p)r 

(b)  There  are  r  malfunctioning  sites  among  the  2r— 1  original  replicating 
sites,  but  both  the  additional  replicas  are  failure-free.  Here,  while  {.he  origi¬ 
nal  set  of  replicas  might  not  yield  the  correct  value  in  a  majority  vote,  the 
augmented  set  will.  The  probability  of  this  occurring  is 

P2  =  (i-y)2pr-1jp^(i-pr> 

=  (i-pr‘. 

Therefore,  the  improvement  in  the  probability  of  obtaining  the  correct  value 
is 

pz-pi  =  (^.-yd-prd-sp) 

which  is  greater  than  0  iff  p  <0.5.  Hence  replication  is  desirable  only  if  p  <0.5. 
In  the  rest  of  this  chapter  we  assume  that  this  condition  holds. 

3.3.  Effect  of  Malfunctions  on  Correctness 

Let  t:y*-J{z)  be  a  transaction  entered  into  the  network  which  seeks  to 
update  y  to  a  new  value  which  is  a  function  of  the  current  value  of  z.  We  call 
z  the  read-varia6fe  and  y  the  write^uariable.  Assume  we  have  one  copy  of  z 
at  site  X  and  three  copies  of  y  at  Yl.  Y2  and  F3  (  Fig.  3.1  ).  Assume  that  site 
X  is  malfunctioning.  Then  the  values  of  z  or  f(z)  (depending  on  where  f(z)  is 
computed)  sent  to  TT,  Y2  and  Y3  may  be  incorrect  and  may  be  different.  If 
no  precautions  are  taken,  the  copies  of  y  will  take  on  incorrect  and  divergent 
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values.  For  Y 1,  Y2  and  F3  to  reach  agreement  is  non-trivial,  since  there 
may  be  malfunctioning  sites  among  them  too.  If  other  portions  of  the  global 
information  are  thereafter  updated  directly  or  indirectly  as  a  function  of  y, 
the  incorrectness  of  the  latter  gets  propagated.  This  kind  of  error  propaga¬ 
tion,  if  unchecked,  will  increasingly  disrupt  the  functioning  of  the  network. 
To  check  it,  in  some  cases  it  will  be  sufficient  if  Fl,  Y2  and  Y3  take  on  some 
common  value  for  y  but  in  others  additional  restrictions  on  this  common 
value  will  have  to  be  enforced. 

As  an  example  to  illustrate  the  importance  of  maintaining  the  correct¬ 
ness  of  global  information,  consider  a  dynamic  packet  radio  network  in  which 
a  group  of  sites  wishes  to  perform  some  task,  composed  of  a  set  of  subtasks. 
Assume  that  the  group  has  first  to  determine  how  many  sites  are  present  in 
the  group  and  how  they  are  connected  and  then,  based  on  this  topology  infor¬ 
mation,  to  assign  subtasks  to  sites.  Assume  that  for  reliability  these  steps 
are  to  be  done  in  a  distributed  manner  and  that  the  following  method  is 
chosen.  Each  site  communicates  with  the  rest  of  the  group  and  determines 
the  topology  -  Then  each  of  the  sites  applies  a  common  algorithm  to  compute 
the  assignment  of  subtasks  to  sites.  Then  we  require  that  a)  the  correctly 
operating  sites  arrive  at  a  common  view  of  the  topology  so  that  assignment 
of  subtasks,  though  done  in  a  distributed  manner,  is  consistent  and  b)  this 
common  view  at  least  "closely"  represent  the  true  topology,  otherwise  the 
assignment  may  prove  ineffective.  (Consider  what  may  happen  if  a  large 
number  of  non-existent  "ghost"  sites  are  imported  into  the  view  by  malfunc¬ 
tioning  sites.  Then  critical  subtasks  may  be  assigned  to  ghost  sites.) 


.• .v .  . . 


3.4.  Outline  of  Proposed  Approach  for  Maintaining  Correctness 

As  explained  in  Section  3.2,  we  can  replicate  global  information  and 
store  the  copies  at  different  sites  in  order  to  guard  its  correctness  against 
malfunctioning  sites. 

Suppose  a  piece  of  information  is  replicated  at  (  2m+l  )  sites.  m= 
0,1,2...  As  long  as  no  more  than  m  of  these  sites  malfunction,  any  site  can, 
by  consulting  all  2m+l  sites  and  taking  the  majority  value  obtain  the  correct 
value. 

In  stating  that  the  correct  value  can  be  obtained  by  the  above  pro¬ 
cedure,  we  are  assuming  that  the  following  conditions  hold:  (a)  the  failure- 
free  sites  have  the  same  stored  value  and  (b)  this  value  is  correct.  However, 
even  if  a  majority  of  the  2m+l  sites  are  failure-free,  conditions  (a)  or  (b)  or 
both  could  be  violated  if  precautions  are  not  taken  when  updating  the  infor¬ 
mation.  This  is  because  of  the  phenomenon  of  error  propagation  explained  in 
Section  3.3. 

The  problem  of  preventing  error  propagation  can  be  stated  as  follows. 
Assume  that  the  update  j /*-/(z)  has  been  submitted  to  the  system.  There 
are  2f+l  copies  of  z  each  at  a  different  site.  This  set  of  sites  is  called  the 
transmitter  set  Similarly  there  are  2r+l  copies  of  y  stored  in  the 

receiver  set  \R\.  (  The  sites  bolding  copies  of  a  write-variable  in  a  transac¬ 
tion  are  called  receivers,  and  the  sites  holding  copies  of  a  read-variable  of 
the  transaction  are  called  transmitters.  Note  that  the  same  site  may  be  a 
transmitter  in  one  transaction  and  a  receiver  in  another.  )  Ve  will  assume  in 
this  chapter  that  =  nil,  i.e.  the  sets  are  disjoint 

In  order  to  prevent  error  propagation  as  a  result  of  processing  the 


update,  two  steps  must  be  taken: 


(i)  the  faiiure-free  sites  in  $/?}  must  reach  agreement  on  the  value  of  z.  We 
call  this  the  unanimity -reaching  step. 

(ii)  the  value  of  z  agreed  on  must  be  Verified  to  be  correct.  The  extent  to 
which  this  can  be  done  depends  on  the  knowledge  that  the  sites  in  \R\  have 
regarding  what  values  of  z  are  reasonable.  This  knowledge  is  stored  in  the 
form  of  assertions.  We  call  this  step  of  verifying  that  z  satisfies  these  asser¬ 
tions  the  acceptability-checking  step.  The  limitation  here  is  that  in  some 
cases,  it  may  be  difficult  to  develop  assertions  that  can,  to  a  useful  extent, 
restrict  wrong  values  from  passing.  In  cases  where  such  assertions  cannot 
be  generated  but  it  is  crucial  to  protect  the  updated  information,  the  only 
solution  appears  to  be  to  increase  the  degree  of  replication  of  the  read- 
variable  and  thus  diminish  the  probability  of  obtaining  a  wrong  value. 

We  use  the  the  example  given  above  of  the  group  of  sites  that  wish  to 
determine  their  topology  and  assign  subtasks  to  different  sites  in  the  group. 
Here  the  receivers  are  the  sites  in  the  group,  y  is  the  topology  informa¬ 
tion.  z  is  the  position  of  any  given  site  and  \T j  is  the  set  of  sites  reporting 
the  position. 

In  the  unanimity-reaching  step,  the  sites  in  \R  j  reach  agreement  on  the 
position  of  a  site.  The  acceptability-checking  step  can  be  used  to  try  to 
screen  out  "ghost"  sites.  For  instance,  the  assertion  we  may  require  to  be 
satisfied  may  be  that  all  the  sites  in  some  appropriately  defined  "vicinity”  of 
an  alleged  site  confirm  the  existence  of  that  site.  How  effective  the  assertion 
is  depends  on  the  presence  of  failure-free  sites  in  the  vicinity. 

In  general,  the  unanimity-reaching  and  acceptability-checking  steps 
may  be  intertwined  or  follow  one  another  in  either  order  depending  on  the 
problem  at  hand.  For  example,  if  the  size  of  z's  representation  is  large,  and 
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if  the  value  received  is  found  to  fail  the  acceptability-checking  step  a  special 
symbol  denoting  an  unacceptable  value  may  be  used  in  the  unanimity- 
reaching  step. 

The  nature  of  the  acceptability-checking  step  is  very  much  dependent 
on  the  problem  at  hand.  Hence  we  will  not  discuss  it  further.  We  will  discuss 
the  unanimity-reaching  step  in  detail,  but  first  we  give  a  brief  description  of 
the  results  available  from  the  literature  that  are  relevant. 

3.5.  The  Byzantine  Generals  Agreement  Problem 

A  number  of  papers  have  appeared  on  the  so-called  Byzantine  Generals 
Agreement  (BGA)  problem  [PEA  BO.  1AM  BO.  D0LB1.  DOL  82a,  DOL  82b, 
DOL  B2c]. 

Consider  a  site  T  which  wishes  to  transmit  a  value  Vto  a  set  of  receiving 
sites  (/?{.  Then  the  Byzantine  Generals  Agreement  is  reached  among  the 
sites  in  [/?]  if  the  following  conditions  are  fulfilled: 

1.  If  the  transmitter  is  failure-free,  all  failure-free  receivers  agree  on  Fas  the 
common  value. 

2. A11  failure-free  receivers  arrive  at  the  same  value,  whether  the  transmitter 
is  malfunctioning  or  failure-free. 

To  show  the  nature  of  this  agreement,  we  show  an  example  of  a  network 
of  four  sites  in  Fig.  3.2.  Assume  that  the  transmitting  site  (marked  T  in  the 
figure)  is  required  to  transmit  the  value  5  to  the  receiving  sites  (marked  R  in 
the  figure).  We  show  two  possible  situations,  in  the  left  and  right  parts  of  the 
figure  respectively,  the  first  involving  a  malfunction  in  one  of  the  receivers 
and  the  second  in  the  transmitter  itself.  We  show  how  an  algorithm  given  in 
[LAM  80]  is  applied  to  this  network  to  reach  the  BGA,  assuming,  as  is  true  for 
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FIG.  3.2.  REACHING  GBGA  IN  THE  PRESENCE  OF  ONE  MALFUNCTIONING  SITE 


the  two  situations  described  above,  that  there  is  at  most  one  malfunctioning 
site  in  the  network.  The  algorithm  requires  two  phases  of  communication  in 
our  example,  under  the  assumption  made  above.  In  the  first  phase,  the 
transmitter  sends  its  value  i.e.  5  to  all  the  receivers.  Note  that  in  the  first 
situation  the  transmitter  is  malfunctioning  and  does  this  incorrectly.  In  the 
second  phase,  each  receiver  sends  the  value  received  in  the  first  phase  to  all 
receivers  including  itself.  Note  that  in  the  second  situation,  one  of  the 
receivers  is  malfunctioning  and  executes  this  phase  incorrectly.  After  this 
phase,  each  receiver  computes  the  median  of  the  values  received  in  the 
second  phase.  A  quick  look  at  Fig.  3.2  will  verify  that  all  correctly  operating 
receivers  arrive  at  the  same  value,  5  in  the  first  situation,  2  in  the  second.  In 
the  first  situation,  the  transmitter  is  failure-free  and  each  failure-free 
receiver  has  received  a  majority  of  values  corresponding  to  the  transmitter 
value.  The  median  computed  is  thus  the  transmitter  value.  In  the  second 
situation,  the  transmitter  is  malfunctioning.  Although  the  receivers  compute 
a  median  value  which  is  different  from  the  correct  value  (5).  they  all  com¬ 
pute  the  same  median  value  (2).  Thus  the  conditions  of  8GA  are  fulfilled  in 
both  situations. 

The  BGA  problem  has  two  variants  corresponding  to  whether  an  authen¬ 
tication  facility  is  available  or  not  (The  example  shown  above  did  not  use 
authentication.)  Authentication  permits  a  site  to  seal  its  messages  so  that 
another  site  receiving  them  can  assure  itself  that  their  contents  have  not 
been  altered  even  though  the  messages  were  handled  on  the  way  by  other 
sites  before  reaching  it.  Thus  although  a  malfunctioning  site  can  abstain 
from  relaying  a  message  which,  had  it  been  failure-free  it  would  have  relayed, 
it  cannot  tamper  with  its  contents  and  then  relay  the  message  without  being 
detected.  The  authentication  facility  can  be  implemented  using  public-key 
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encryption  [DIF  76.RIV  78]. 

Consider  a  network  of  TV  sites  with  at  most  M  malfunctioning  sites.  It  has 
been  shown  [PEA  BO]  that  if  authentication  is  not  available,  it  is  necessary 
that  N  >  3M  and  whether  or  not  authentication  is  available,  at  least  Jtf+J 
phases  of  communication  are  required  [DOL  B2c]. 

Table  3.1  shows  some  of  the  features  of  the  published  algorithms  to  solve 
the  BGA  problem.  It  can  be  seen  that  algorithms  of  polynomial  complexity 
are  available  for  both  variants  of  the  BGA  problem. 

The  required  number  of  messages  mentioned  in  the  table  reflects  the 
worst  case.  Considering  the  algorithms  which  do  not  employ  authentication, 
the  algorithm  indicated  in  the  last  row  has  polynomial  worst-case  communi¬ 
cation  cost  whereas  those  in  the  second  row  (the  algorithms  indicated  in  this 
row  are  variants  of  the  same  basic  approach)  have  exponential  worst-case 
communication  costs.  But  the  former  algorithm  requires  more  than  twice 
the  number  of  phases  and  is  vastly  more  complex.  Further,  the  algorithms 
in  the  second  row  can  be  modified  so  that  while  they  can  still  handle  upto  M 
malfunctioning  sites,  they  require  only  about  N *  messages  when  there  are 
actually  no  malfunctions  [LAM  Bla],  which  will  usually  be  the  case.  For  these 
reasons  they  may  be  preferred  to  the  algorithm  indicated  in  the  last  row. 

We  have  assumed  so  far  that  there  is  a  direct,  failure-free  link  between 
every  pair  of  sites.  In  [DOL  61],  it  is  shown  how  these  algorithms  can  be 
extended  to  a  point-to-point  network,  where  the  connectivity  is  not  complete 
and  the  links  are  not  failure-free.  Instead  of  having  a  bound  on  the  number 
of  malfunctioning  sites,  a  bound  is  imposed  on  the  number  of  malfunctioning 
sites  plus  failed  links.  Each  message  from  one  site  to  another  is  sent  along  a 
sufficient  number  of  disjoint  routes,  so  that  effectively  perfect  virtual  con- 
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nections  are  provided  between  every  pair  of  sites.  In  presenting  our  results 
below,  we  will  continue  to  assume  perfect  connectivity,  but  the  techniques  of 
[DOL  81]  can  be  used  to  relax  this  assumption. 

In  [LAM  Bla]  a  scheme  for  using  BGA  solutions  to  implement  distributed 
systems  that  are  able  to  tolerate  malfunctioning  sites  is  described.  The 
basic  version  of  this  scheme  involves  replication  of  all  functions  and  informa¬ 
tion  at  every  site  in  the  system.  Transactions  entering  at  any  point  in  the 
system  are  timestamped  and  broadcast  using  BGA  techniques  so  that  mal¬ 
functioning  sites  are  unable  to  prevent  agreement  among  failure-free  sites 
as  to  the  transactions  received.  Control  messages  exchanged  among  the 
sites  e.g  for  commit  processing,  also  use  this  reliable  broadcast  mechanism. 
Thus  all  failure-free  sites  see  the  same  input  stream  of  messages  and  exe¬ 
cute  the  same  sequence  of  actions.  As  long  as  the  number  of  malfunctioning 
sites  satisfies  the  bounds  assumed  by  the  BGA  algorithm  being  used,  the  sys¬ 
tem  as  a  whole  performs  correctly.  If  the  bounds  are  exceeded,  the  informa¬ 
tion  stored  in  the  failure-free  sites  may  diverge  and  from  that  point  on,  the 
system  may  perform  incorrectly  until  appropriate  repair  actions  are  ini¬ 
tiated  from  outside. 

In  many,  if  not  most,  networks,  such  complete  replication  would  be 
infeasible  since  it  would  require  too  much  storage  at  each  site.  It  is  sug¬ 
gested  in  [LAM  Bla]  that  in  such  cases,  only  critical  functions  be  completely 
replicated  and  managed  according  to  the  above  scheme,  constituting  a  syn¬ 
chronizing  kernel  for  the  system.  Other  functions  and  information  would  be 
managed  by  a  separate  mechanism.  The  example  of  a  distributed  file  system 
is  given  as  an  illustration.  Here  the  directory  information  and  the  open  file 
end  close  file  operations  would  be  in  the  synchronizing  kernel  but  not  the 
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files  themselves  or  the  read  file  and  write  file  operations;  these  would  be 
handled  by  a  separate  mechanism. 

This  separate  mechanism  would  be  used  to  access  information  at  a 
remote  site  when  it  is  unavailable  locally.  Here  the  danger  of  error- 
propagation  discussed  in  Section  3.3  arises,  since  the  remote  site  may  be 
malfunctioning.  Our  solution  to  this  problem,  as  discussed  in  Section  3.4 
consists  of  a  unanimity-reaching  step  and  an  acceptability-checking  step. 
For  the  unanimity-reaching  step,  the  BGA  problem  is  relevant.  However,  we 
may  have  a  multiplicity  of  transmitters  (  since  the  information  is  replicated, 
though  partially  )  whereas  the  BGA  has  been  stated  for  the  case  of  one 
transmitter.  Also  it  may  be  too  expensive  to  use  BGA  techniques  on  each 
remote  access  to  global  information.  We  discuss  these  issues  in  Section  3.6. 

3.6.  Details  of  Proposed  Approach 

3.6. 1.  The  Generalized  Byzantine  Generals  Agreement  Problem 

The  requirements  of  the  unanimity-reaching  step  of  Section  3.4  may  be 
stated  as  follows. 

Given  a  set  of  transmitters  \T\  each  of  which  has  a  copy  of  a  given  piece 
of  information  (  the  read-variable  ),  and  a  set  of  receivers  {/?{  (  which  hold 
copies  of  the  write-variable  )  which  wish  to  access  this  information,  the  Gen¬ 
eralized  Byzantine  Generals  Agreement  (GBGA)  is  reached  by  the  receivers  if 

a)  All  failure-free  receivers  agree  on  the  same  value. 

b)  If  a  majority  of  transmitters  are  failure-free,  and  each  of  the  transmitters 
in  this  majority  has  the  same  value  V  for  the  information,  then  the  receivers 
agree  on  V  as  the  common  value. 


lUk’iU, 
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Clause  a)  is  the  same  as  for  the  BGA.  The  changes  for  clause  b)  are  sim¬ 
ple  to  understand.  If  a  majority  of  transmitters  are  not  failure-free,  i.e.  if  a 
majority  of  malfunctioning  sites  exists,  the  latter  by  acting  in  collusion  can 
make  it  impossible  to  deduce  that  the  remaining  minority  of  transmitters 
are  the  ones  from  which  the  correct  value  is  to  be  obtained.  The  reason  for 
requiring  that  the  failure-free  site  majority  of  transmitters  should  have  the 
same  value  is  that  there  is  a  possibility  that  they  may  have  divergent  values 
because  of  prior  error  propagation,  in  which  case,  clause  a)  already  specifies 
the  best  the  receivers  can  hope  to  do. 

3.6.2.  Malfunction-Tolerance  Specification 

If  we  wish  to  use  GBGA  to  process  an  update  transaction  y «-/ (x)  then  we 
must  specify  the  bounds  on  the  number  of  malfunctioning  sites  called  the 
malfunction-tolerance  specification, (MTS).  The  protocols  used  in  processing 
the  update  will  then  be  such  that  as  long  as  the  actual  number  of  malfunc¬ 
tioning  sites  is  within  the  MTS,  GBGA  will  be  reached  by  the  receivers. 

Assume  that  there  are  (2 1  +1)  sites,  forming  the  set  of  transmitters  \  T\. 
which  hold  copies  of  x  and  (2r  +  l)  sites  forming  the  set  of  receivers  }/?{  hold¬ 
ing  copies  of  y.  F(S)  denotes  the  number  of  malfunctioning  sites  in  the  set 
S.  Examples  of  possible  MTSs  are: 

(a  )F(\T\)=0 

(b) F(\Tl)<l 

(c) F(\T\)<t 

(d) F(j*j)*r 

(e) F(\/tlU\T{)*r,e  tc. 

These  MTSs  specify  different  degrees  of  malfunction-tolerance  and  the 
protocols  that  achieve  GBGA  given  these  MTSs  have  different  communication 
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and  computation  costs  associated  with  them.  For  example,  for  MTS  (a),  any 
single  transmitter  can  be  accessed  to  get  the  value  of  x;  for  MTS  (b).  three 
transmitters  can  be  accessed  (  assuming  |r|fe3  )  and  the  majority  value 
taken,  and  so  on.  Thus  a  tradeoff  exists  between  these  costs  and  the  degree 
of  malfunction-tolerance  obtained. 

3.6.3.  Scheme  Specification 

The  global  information  consists  of  a  set  of  data  items.  Suppose  we  are 
given  the  update  interactions  between  them  in  the  form  of  a  set  of  ordered 
pairs  (Zy,Zj)  where  the  existence  of  a  pair  (xj.Xj)  implies  the  existence  of  an 
update  interaction  [We  assume  update  interactions  of  this  form 

for  simplicity  ]  Suppose  we  are  also  given  the  degree  of  replication  for  each 
item  Xj . 

The  scheme  specification  ( SS )  specifies  a  protocol  for  each  update 
interaction  pair.  For  example,  SS  could  specify  that  for  a  given  pair  (x,,x;) 
the  protocol  for  this  interaction  should  achieve  GBGA  with  the  MTS 
(  Here  the  degree  of  replication  of  x*  is  2r  +  l.) 

Given  the  data  items,  the  update  interactions,  the  degrees  of  replication 
and  the  scheme  specification,  the  behavior  of  the  system  information  as  to 
its  correctness  characteristics  in  the  presence  of  malfunctioning  sites 
(including  the  error  propagation  effects)  can  be  deduced.  Hence  in  order  to 
obtain  the  desired  behavior,  the  degrees  of  replication  and  the  scheme 
specification  should  be  chosen  appropriately.  Below  we  give  two  examples  of 
possible  scheme  specifications. 
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3.6.2.I.  Scheme  Specification  A 

In  Section  3.4,  we  commented  that  the  correct  value  of  a  piece  of  infor¬ 
mation  y  could  be  obtained  as  long  as  a  majority  of  the  sites  bolding  copies 
of  it  were  failure-free,  provided  they  were  not  contaminated  by  error  propa¬ 
gation.  This  contamination  occurs  when  the  information  is  being  used  as  a 
write-variable,  i.e.  when  the  sites  holding  copies  of  it  are  acting  as  receivers 
in  a  transaction  y*-/(x).  If  the  failure-free  sites  among  the  sites  holding 
copies  of  y  are  a  minority,  the  correct  value  of  y  would  not  be  available  and 
there  would  be  no  point  in  trying  to  ensure  that  these  failure-free  sites  arrive 
at  the  same  value  of  the  read-variable  x  (or  /(x)  if  the  transmitters  do  the 
computation  of  /  (x)).  Hence  it  is  reasonable  to  use  the  following  scheme 
specification: 

SS  A  GBGA  must  be  reached  in  whenever  F([R  j)*r. 

SS  A  achieves  the  extreme  of  absolutely  no  error  propagation  in  the  following 
sense.  Consider  a  chain  of  update  interaction  pairs:  ( 

[i.e.  Xj  is  updated  with  Xy  as  read-variable,  Xy  is  updated  with  x;  as  read- 
variable  in  turn,  etc.].  Suppose  the  malfunctioning  sites  among  those  sites 
which  hold  copies  of  Xj  constitute  a  majority.  As  a  result  of  the  particular 
scheme  specification  used  here,  xt  will  not  get  contaminated  by  the  update 
interaction  Xy  *-Zj.  Hence,  as  long  as  the  sites  holding  copies  of  any  item  Xy 
have  a  failure-free  majority,  Xy  will  be  correct,  since  it  cannot  be  contam¬ 
inated  by  error-propagation. 

SS  A  can  be  implemented  by  the  following  algorithm: 


Alg 

(1)  Each  member  of  |/?|  samples  each  member  of  J 7* {  and  computes  the 
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majority  of  the  |  T  |  values  received. 

(2)  Each  member  of  j/?j  broadcasts  the  value  obtained  in  step  (l)  to  all  the 
sites  in  using  a  BGA  algorithm  with  a  bound  r  on  the  number  of  malfunc¬ 
tioning  sites  in  {/?{. 

(3)  Each  member  of  j/?j  computes  the  median  of  the  |#|  values  received  in 
step  (2). 

Note  that  the  median  of  a  set  of  values  coincides  with  the  majority  value,  if 
one  exists,  and  is  unique  for  a  given  set  of  values.  Thus  if  each  receiver 
receives  the  same  set  of  values  (  which  may  not  have  a  majority  value  ),  or  if 
all  receivers  receive  sets  of  values  having  a  common  majority  value,  then 
computing  the  median  as  the  final  value  ensures  unanimity. 

As  mentioned  in  Section  3.5  BGA  has  two  variants.  When  authentication 
is  not  available,  step  (2)  can  be  executed  only  if  |/?|  is  at  least  3r  +  l.  Only 
2r  +  l  copies  of  a  piece  of  information  are  required  to  preserve  its  availability 
in  the  presence  of  upto  r  malfunctioning  sites.  Hence  there  must  be  r  extra 
sites  among  the  receivers,  which  need  not  have  physical  copies  of  the  infor¬ 
mation,  but  which  take  part  in  the  algorithm  described  above  (in  Fig.  3.3(a), 
they  are  shown  as  having  "phantom"  copies).  At  the  end.  the  sites  with  physi¬ 
cal  copies  update  them  to  the  median  value  computed  in  step  3.  "(/?}”  in  the 
scheme  specification  must  be  taken  to  mean  these  3r  +  l  sites  out  of  which 
only  2r  +  l  actually  have  copies  of  the  information  and  take  part  in  transmit¬ 
ting  it  when  it  is  used  as  read-variable  in  some  update  transaction.  As  shown 
in  Fig.  3.3(b),  no  phantom  copies  are  required  when  authentication  is  avail¬ 
able. 

For  both  variants,  r+2  phases  of  communication  are  required.  Hence 
this  scheme  specification  would  be  prohibitively  expensive  in  the  amount  of 
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communication  required  and  the  time  taken  to  process  an  update.  Thus,  it  is 
feasible  to  use  it  only  if  updates  are  very  rare  and  the  need  to  protect  the 
information  is  critical. 

Below  we  present  another  scheme  to  remedy  the  drawbacks  of  SS  A 
3.6.3. 2.  Scheme  Specification  B 

SS  B.  as  in  SS  A  except  that  when  |T|fe|/?|,  majority  voting  on  the 
values  sent  out  by  the  transmitters  is  used  by  every  member  of  {/?}  to  get 
the  value  of  the  read-variable. 

Conceptually,  the  global  information  can  be  divided  into  domains 
depending  on  the  degree  of  replication,  e.g.  a  1-copy  domain,  a  3-copy 
domain,  a  5-copy  domainetc  (Fig.  3.4).  SS  B  has  the  property  that  contami¬ 
nation  can  spread  within  a  given  domain  and  into  lower-order  domains  but 
not  into  higher-order  domains.  By  properly  allocating  the  global  information 
to  the  domains,  the  number  of  updates  which  require  GBGA  protocols  can  be 
reduced  and  thus  the  number  of  updates  incurring  the  high  communication 
costs  and  processing  time  typical  of  SS  A.  A  critical  piece  of  information 
should  be  placed  in  a  higher-order  domain  so  that  it  is  less  likely  to  become 
unavailable  as  a  result  of  a  majority  of  the  sites  that  bold  copies  of  it  mal¬ 
functioning.  By  the  same  argument,  a  less  critical  piece  of  information 
should  be  placed  in  a  lower-order  domain.  Thus  it  is  more  likely  to  become 
unavailable.  However,  it  is  prevented  from  contaminating  the  more  critical 
information  by  the  protocol  which  governs  such  interactions. 

For  example,  in  a  banking  application,  all  information  relating  to 
accounts  larger  than  or  equal  to  a  dollars  could  be  placed  in  the  3-copy 
domain,  and  information  relating  to  accounts  smaller  than  a  dollars  could  be 
placed  in  the  1-copy  domain.  Host  transactions  would  be  limited  to  a  single 
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domain.  A  funds  transfer  from  an  account  in  the  1-copy  domain  to  another 
in  the  3-copy  domain  would  cause  the  GBGA  protocol  to  be  invoked  in  updat¬ 
ing  the  latter. 

a. 7.  Intermediate  Cost  Protocols 

3.7.1.  Motivation 

Consider  the  following  MTS: 

Ml:  F(\T\)£t  with  all  failure-free  transmitters  having  the  same  value. 

Reaching  GBGA  under  Ml  can  be  done  simply  and  inexpensively  by  hav¬ 
ing  each  receiver  take  the  median  of  the  values  of  all  the  transmitters.  But 

consider  the  update  chain  ...zh  *-xi*-xi  *~xk  «-xt .  If  for  any  of  the  variables. 

the  number  of  malfunctioning  replicas  is  a  majority,  error  can  propagate 
backwards  along  the  chain.  This  MTS  is  used  in  SIFT  [WEN  78]. 

Consider  next  the  following  MTS: 

M2.  F(\R\)*t. 

This  is  the  MTS  used  in  SS  A.  Although  reaching  GBGA  under  M2  is 
expensive  as  mentioned  in  Section  3.6.3. 1,  there  is  no  error  propagation. 

Now  consider  the  MTS: 

M 3:  F{\T\\j\R^ysr,  with  all  failure-free  transmitters  having  the  same  value. 
It  will  be  shown  below  that  to  reach  GBGA  for  this  MTS,  it  requires  an  algo¬ 
rithm  whose  costs  are  a  function  of  the  difference  in  the  degree  of  replica¬ 
tion  of  the  read  and  write-variables  decreasing  as  the  difference  decreases. 
This  is  an  appealing  property,  for  as  we  noted  in  Section  3.2,  the  degree  of 
replication  is  a  measure  of  the  probability  of  the  information  becoming  una¬ 
vailable  because  a  majority  of  the  sites  holding  copies  of  it  are  malfunction¬ 
ing.  One  would  like  to  be  more  careful  (at  the  expense  of  higher  incurred 
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costs)  in  interacting  with  information  that  is  more  likely  to  be  incorrect. 
This  MTS  facilitates  this.  However,  it  permits  error-propagation  because  of 
the  following  reason.  Considering  any  update  in  the  chain  mentioned  above, 
if  the  bounds  specified  on  the  number  of  malfunctioning  transmitters  and 
receivers  is  exceeded,  GBGA  may  not  be  reached  among  the  copies  of  the 
variable  updated.  Then,  considering  the  preceding  update  in  the  chain,  the 
condition  in  the  MTS  specifying  that  all  failure-free  transmitters  should  have 
the  same  value  is  not  satisfied.  Hence  GBGA  may  not  be  reached  for  this 
update,  even  though  the  bound  on  the  number  of  malfunctioning 
transmitters  and  receivers  for  this  update  is  satisfied.  In  this  way  error  can 
propagate  backwards  along  the  chain.  Therefore,  at  appropriate  points  on 
the  chain,  an  MTS  which  prevents  error  propagation  e.g.  U 2,  should  be  used, 
and  between  these  points  MTSs  such  as  Af  1  and  M 3  could  be  used. 

In  Section  3.7.2,  the  minimum  total  number  of  sites  required  to  reach 
GBGA  without  authentication  under  MTS  M3  is  determined.  In  Sections  3.7.3 
and  Sections  3.7.4,  algorithms  for  reaching  GBGA  without  and  with  authenti¬ 
cation  under  this  MTS  are  developed. 


3.7.2.  Minimum  Number  of  Sites  for  GBGA  under  MTS  M3 

Consider  the  following  situation.  A  network  has  N  sites  of  which  a  set 
(Tj.  |  7~  |  -2t  +  1<N  has  a  value  to  transmit  to  the  rest  of  the  sites  in  the  net¬ 
work  which  are  the  receivers.  It  is  assumed  that  all  failure-free  transmitters 
have  the  same  value.  The  number  of  malfunctioning  sites  in  the  network  <m. 
The  problem  is  to  find  the  minimum  value  of  N,  N^.  such  that  GBGA  can  be 
reached  among  the  receivers.  If  hm  or  there  is  only  one  receiver,  then  a 
simple  majority  vote  solves  the  problem.  Hence,  from  now  on  we  only  con¬ 
cern  ourselves  with  the  case  where  t  <m  and  there  is  more  than  one 
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receiver. 


It  is  clear  that  AT^sSm  +  l,  for  if  Affc3m  +  1,  each  transmitter  can 
send  its  value  to  all  the  sites  in  the  network  using  a  BGA  algorithm 
parameterized  for  a  bound  of  m  malfunctioning  sites.  Then  each  receiver 
can  take  the  median  of  the  2f  +  l  values  received  to  be  the  value  of  the 
transmitters  and  GBGA  will  be  reached.  [  Note  however  that  the  costs  of 
reaching  GBGA  with  this  procedure  are  not  a  function  of  the  difference  in  the 
number  of  transmitters  and  the  number  of  receivers.  Algorithms  which  do 
have  this  property  will  be  presented  in  Sections  3.7.3.  and  3.7.4.  ] 

We  show  that  the  above  bound  is  tight.  i.e.  =  3m +  1.  For  this  pur¬ 
pose,  we  use  the  concept  of  scenarios  introduced  in  [PEA  BO],  Our  proof  is  a 
non-trivial  extension  of  the  proof  given  in  [PEA  BO]  to  establish  that  at  least 
3m +  1  sites  are  required  to  reach  BGA  in  a  network  with  at  most  m  malfunc¬ 
tioning  sites,  if  authentication  is  not  used. 

Let  \P\  be  the  set  of  sites  in  the  network,  and  define  a  scenario  5  as  a 
mapping  from  the  set  of  non-empty  strings  W  over  \P\  and  ending  with  a 
transmitter,  to  V,  the  set  of  values.  For  a  given  r  in  [R J,  define  the  r- 
scenario  Sr  corresponding  to  a  scenario  5  as  a  restriction  of  the  mapping  S 
to  strings  in  W  beginning  with  r. 

Let  [X-}  be  a  set  of  sites  in  the  network  which  are  failure-free.  A  scenario 
S  is  consistent  with  }Xf},  if  for  each  ge\Xl,pt\P\,  wc  W,  S(j>tpo)=S(qw),  Le. 
each  site  in  JXf  j  always  relays  values  it  receives  correctly. 

For  each  r  in  \R\  =  [/*{-[  T],  let  Fr  be  a  mapping  that  takes  a  r- 
scenario  Sr  and  returns  a  value  in  V,  which  is  the  value  or  the  transmitters 
arrived  at  by  r  finally.  In  order  for  |/V|re[/?{{  to  provide  GBGA  for  each 
scenario  S  consistent  with  some  set  of  sites  \X\, \X |k.N-m,  we  must  have 
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(Al)  if  a  majority  of  transmitters  are  failure-free  and  have  a  value  vt, 

Fr(Sr)=vt  for  all rejSjnM- 

(A2)  for  p,qt\R l  r\W.  Fp(Sp)=F,(St). 

Suppose  N^3m  and  if  possible  let  \Fr  provide  GBGA.  Divide  {/?} 

into  three  sets  A,B,C  each  having  at  most  m  sites,  with  A  having  a  majority 
of  transmitters,  B  having  the  remaining  transmitters,  and  both  B  and  C  hav¬ 
ing  at  least  one  receiver(  Fig.  3.5  ).  This  division  is  possible  since  t  <m  and 
there  are  at  least  .two  receivers.  Let  v  and  v’  be  two  distinct  values  in  V. 
Define  the  scenarios  S1.S2.S3  consistent  with  A\jB,A\jC,B\jC  as  shown 
below.  In  the  following  specifications,  Of  and  b,  represent  any  transmitter  in 
A  and  B  respectively,  a,b,c  represent  any  sites  in  A,B,C  respectively  and  u> 
represents  any  string  in  ft. 

SI: 

51(aa<)  =  Sl(ba*)  =  Sl(ca,) 

=  Sl(ab,)  =  Sl(bb, )  =  Sl(cb|)  =  v. 

Sl(aaw)  =  Sl(aw)  Sl(abuj)  =  Sl(btu) 

Sl(bau>)  =  Sl(au;)  Sl(bbui)  =  Sl(bu») 

Sl(caiu)  =  Sl{<ruj)  Sl(cbw)  =  Sl(bw) 

Sl(cctu)  =  Sl(cui) 

Sl(octu)  =  SI(cuj) 

Sl(bnu)  =  S3(cw) 

S2: 

S2(aaj)  =  S2(baj)  =  S2(ca,)  =  v’ 

S2{ab,)  =  S2(bb()  =  S2(cb,)  ~v. 


FIG.  3.5.  NETWORK  CONFIGURATION  FOR  SHOWING  THAT  Nnlin>  3m 


S2(aaw)  =  S2(a w)  S2(aJbw)  =  S' 2(6  to) 

S2(baw)  =  S2(aw)  S2(bbw)  =  S2(bw) 

S2(caw)  -  S2(aw)  S2(cbw)  =  S3(bw) 

S2(acw)  =  S2(cw) 

S2(bcw)  =  S2(cw) 

52(ceui)  =  S2(cw) 

S3: 

S3(aot)  =  S3(bOf)  =  v  S  3(ca,)  =  v' 

53(ab,)  =  S3(bb,)  =  S3(cbt)  =  v. 

S3(aaw)  =  S3(aw)  S3(abw)  -  S3(bw) 

S3(baw)  =  5l(aui)  S3(bbw)  =  S3(bto) 

S3(caiu)  =  S2(aw)  S3(cbw)  =  53(6u>) 

S3(acw)  =  S3(eto) 

53(bcto)  =  S3(cw) 

S3(ccw)  =  S3(cu>) 

Next  we  show  that 
E:(i)  S3(bto)  =  Sl(bto) 

(ii)  S3(cw)  =  S2(cxu) 

E  is  true  when  to  is  of  length  1.  This  follows  directly  from  the  fact  that 
is  then  either  a*  or  6,  and  from  the  scenario  specifications. 

Assume  E  true  for  |to  |sZ.  Let  to,  be  a  string  in  W  of  length  l. 

(a)From  the  scenario  specifications,  we  have 
S3(bawi)  =  Sl(nto,)  and 
Sl(6ato,)  =  Sl(oui,). 
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Therefore 

S3(bawl)  -  Sl(bawi). 

(b) From  the  scenario  specifications,  we  have 

53(66144)  -  53(614)  and 
51(66144)  =  51(614). 

By  our  inductive  assumption, 

53(6144)  =  51(6144). 

Therefore 

53(66144)  =  51(66144). 

(c) From  the  scenario  specifications,  we  have 

53(60144)  =  53(014)  and 
51(6014)  =  53(ct4). 

Therefore 

53(60144)  s  51(60144). 

From  the  above  it  follows  that  53(6tu)  =  51(614)  for  |tt>  |<Z  +  1.  Similarly 
53(ci4)  =  52(ci4)  for  ji4|sf  +  l. 

Thus  E  is  true  for  |t4  |ssl  +  l,  and  therefore  for  all  |i4  |. 

Let  6,.  and  cr  be  receivers  in  B  and  C  respectively.  Then  53*r  =  51^ 
and  53jr  =  52^. 

By  Al, 

/'*r(53*r)  =  /ir(51*r)  =  v  and 

By  A2, 

V53*r>  =  /ir(5  3^) 

implying  v=v‘.  This  contradicts  our  earlier  assumption.  Hence  ^0^=3171  +  1. 


3.7.3  Implementing  GBGA  under  MTS  M3  without  Authentication 

The  previous  section  shows  that  if  GBGA  is  to  be  reached  without  using 
authentication  under  the  MTS 

U3:F{\T\\j\R\)^r  with  all  transmitters  having  the  same  value, 
then,  if  t  <r  (  where  |T|=2f  +  l  ).  a  minimum  of  3r  +  l  sites  is  required.  The 
updated  variable  is  replicated  at  2r  +  l  sites.  Hence  a  number  of  "phantom" 
receivers  equal  to  max(0,(2r  +  l)+(2f  +  l)-(3r  +  l))  =  max(0.2<  +  1-r)  will  be 
required.  \R\  is  then  the  set  of  receivers  which  replicate  thq  updated 
variable  plus  the  set  of  phantom  receivers  \Rpl-  However,  the  procedure 
given  in  the  previous  section  for  reaching  GBGA  using  this  configuration 
requires  r  +  1  phases  (  and  hence,  as  mentioned  there,  its  costs  are  not  a 
function  of  r  —t  ). 

As  mentioned  in  Section  3.7.1,  it  is  possible  to  construct  GBGA  algo¬ 
rithms  for  MTS  M 3  using  fewer  phases.  This  is  done  by  using  BGA  algorithms 
in  a  different  manner  from  that  in  the  previous  section.  For  the  reasons 
mentioned  in  Section  3.5,  we  choose  the  BGA  algorithm  BG1  described  in 
[PEA  BO.LAM  BO.DOL  B1.D0L  B2a]  as  the  kernel  of  the  GBGA  algorithm 
described  below. 

Consider  a  network  consisting  of  a  transmitter  T  and  a  set  of  receivers 
[/?j  with  !J?|as3m.  It  is  required  to  have  the  receivers  reach  BGA  on  the 
value  of  T  as  long  as  the  total  number  of  malfunctioning  sites  in  the  network 
<m.  The  algorithm  BG1  is  as  follows: 

Algorithm  BGl(  m  ): 

(1)  The  transmitter  T  sends  its  value  to  every  receiver  in  (/?{. 

(2)  If  m>0  then 

(a)  for  every  rc\R\,  let  u,  be  the  value  receiver  r  has  obtained  in  step  1. 


Receiver  r  acts  as  the  transmitter  in  the  algorithm  BGl(  \R\-r ,m-\  )  to 
send  the  value  vT  to  every  other  receiver  in  \R\-r . 

(b)  for  every  {  and  each  r  *r’  in  {.  let  iy(r)  be  the  value  receiver  r' 
receives  from  receiver  r  in  step  2a.  If  no  value  is  received,  set  vr(r)  to  D. 
Let  vr{r')  be  the  value  receiver  r”  has  received  from  transmitter  T  in  step 
1.  Receiver  r'  determines  the  value  cf  the  transmitter  as 
median\vr(z)\zt\RH. 

Now,  we  show  how  to  use  algorithm  BG1  to  implement  GBGA  under  MTS  U 3 
with  fewer  phases  than  r+1. 

Consider  a  network  with  |rj=2f  +  l  transmitters,  with  all  failure-free 
transmitters  having  the  same  value,  and  with  at  most  r  (r>t )  malfunctioning 
sitej  in  the  network.  Let  the  remaining  sites  in  the  network  {/?{  be  of 
number  |/?|fe3r-f. 

Algorithm  GBG1: 

(1)  Each  transmitter  in  \T j  sends  its  value  to  a  designated  subset  of 
receivers  \Rf  j,  |  Rr  |  =2r  +  1«|  R  \ . 

(2)  Each  receiver  rr  in  [Rj. j  computes  the  median  of  the  (2f+l)  values 
received  in  step  1  to  obtain  vTf. 

(3)  Each  receiver  rr  in  \Rr  j  broadcasts  its  value  vTf  to  every  other  receiver 
using  BGlOffj-Ty.r-f-l). 

(4)  Let  \Xr\  be  the  set  of  2r  values  received  in  step  3  and  the  single  value 
computed  in  step  2  by  the  receiver  r  in  [R\.  It  computes  the  value  of  the 
transmitters  as  median  \Xf  j. 

Thm  3. 1:  Algorithm  GBGl  provides  GBGA  under  MTS  A/3  in  r  —t  + 1  phases. 
Proof:  Steps  1  and  2  together  involve  one  phase  of  communication.  Steps  3 
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and  4  involve  {r-f-l)+l  or  ( r-t )  phases.  Hence  GBGl  involves  r-f  +  1  I 

phases  in  all. 

Case  1:  A  majority  of  failure-free  transmitters  does  not  exist.  Then  GBGA 

requires  each  failure-free  receiver  to  arrive  at  the  same  final  value.  Since  j 

there  are  at  least  t  +  1  malfunctioning  transmitters,  there  are  at  most  r—t  -1 

malfunctioning  receivers.  From  the  correctness  of  BG1,  it  follows  that  every 

failure-free  receiver  r  has  the  same  set  of  values  [X,.]  in  step  4  and  hence 

unanimity  is  reached. 

Case  2:  A  majority  of  failure-free  transmitters  exists  and  v  is  their  common 
value.  Then  GBGA  requires  each  failure-free  receiver  to  arrive  at  the  final 
value  of  v.  Let  \CRr\  be  the  set  of  failure-free  receivers  in  \Rr\-  \CRr\>r  +  \  I 

since  \Rj-\-2t  +  \  and  there  are  at  most  r  malfunctioning  sites  in  the  net¬ 
work. 

Each  site  crr  in  |C/?rj  computes  the  value  v  in  step  2.  We  claim  that  v 
will  be  the  value  received  from  cry  by  every  failure-free  receiver  in  step  3. 

Then  the  final  value  computed  in  step  4  will  be  v. 

The  proof  of  our  claim  is  based  on  a  lemma  given  in  [DOL  B2a]  for  the 
BGA  algorithm  BG1.  This  lemma  states  that,  in  a  network  with  a  single 
transmitter  T,  a  set  of  receivers  {/?  j  with  at  most  m  malfunctioning  sites, 

BGl(  \R  J,  x  )  provides  BGA  if  the  transmitter  is  failure-free  and  |./?'lss2m.+x. 

In  step  3  of  GBGl,  as  executed  by  cry,  we  have  cry  as  a  failure-free 
transmitter,  executing  BG  with  z=r—t  —  1.  The  set  of  receivers  it  is  transmit¬ 
ting  to.  has  cardinality  3r  -t  -l  =  2r  +(r  -t  -1).  Hence  our  claim  is  proved. 

For  completeness,  we  give  below  the  proof  of  the  above-mentioned 
lemma.  Consider  a  network  consisting  of  a  failure-free  transmitter  T  with  a 
value  v',  and  a  set  of  receivers  \R  j.  \R  ^2m+i  with  at  most  m  malfunction- 


ing  sites.  To  prove  that  BGl(  (/?'{,  i  )  produces  BGA,  we  use  induction  on  the 
value  of  x.  If  x=0,  the  final  value  arrived  at  by  each  failure-free  receiver  is 
the  value  received  in  step  1  of  BGl  from  the  transmitter,  namely  v'.  Hence 
BGA  is  reached  for  x=0. 

Assume  the  lemma  holds  for  x=Jfc(iO).  Consider  x=Jfc  +  l.  In  step  1,  each 
failure-free  receiver, r‘  receives  the  value  1/.  In  step  2a.  it  applies  the  algo¬ 
rithm  BGl(  \R j  — r\  k  )  to  send  the  value  v  to  all  the  receivers  in  the  set 
\R\-r'  which  contains  at  least  2 m+k  sites,  and  hence  the  induction 
hypothesis  implies  that  every  other  failure-free  receiver  obtains  from  r  the 
value  v‘.  The  set  {/?  j  contains  at  least  2m+fc  +  l  sites.  Since  ASfcO,  and  there 
are  at  most  m  malfunctioning  sites  in  \R j,  every  failure-free  receiver  com¬ 
putes  a  final  value  v  in  step  2b.  This  proves  the  lemma. 

Using  the  algorithm  GBGl,  we  can  implement  GBGA  with  MTS  U 3  in  r-t  +  1 
phases  and  with  (3r-f  )-(2r  +  l)  =  r-f-1  phantom  receivers.  Thus,  this  algo¬ 
rithm  reduces  the  communication  overhead  and  number  of  phases  to  a  func¬ 
tion  of  the  difference  in  the  degrees  of  (physical)  replication  of  the  read- 
and  write-variables.  As  mentioned  in  Section  3.7.1.,  this  is  a  desirable  pro¬ 
perty. 

3.7.4.  Implementing  GBGA  under  MTS  M3  using  authentication 

A  similar  reduction  in  the  number  of  phases  can  be  realized  in  con¬ 
structing  an  algorithm  for  reaching  GBGA  under  MTS  M 3  using  authentica¬ 
tion.  The  use  of  authentication  sharply  reduces  the  number  of  messages 
needed.  Our  algorithm  uses  as  its  kernel  a  BGA  algorithm  BG2  suggested  in 
[DOL  B2c]  that  uses  authentication. 


Consider  a  network  consisting  of  a  transmitter  T  and  a  set  of  receivers 
{/?{  which  has  at  most  m  malfunctioning  sites. 

Algorithm  BG2: 

(1)  The  transmitter  T  signs  and  sends  its  value  to  all  receivers. 

(2)  Each  receiver  waits  for  the  receipt  of  messages.  If  during  phase  k,  a 
receiver  receives  a  message  containing  value  v  and  signed  by  k  distinct  sites 
(  beginning  with  the  transmitter  ).  then  the  receiver  inserts  v  in  its  list  of 
received  values  if  not  already  in  it  and  if  the  list  does  not  already  contain  two 
values.  If  the  value  v  gets  inserted,  and  Jfc<m  +  1  then  the  receiver  signs  the 
message  and  sends  it  to  all  receivers,  whose  signatures  are  not  in  the  mes¬ 
sage,  in  phase  fc  +  1. 

(3)  After  phase  m  +  1,  if  a  receiver  has  exactly  one  member  in  its  list  of 
received  values,  then  that  is  chosen  as  the  final  value,  otherwise  it  agrees  on 
a  default  value. 

This  algorithm  requires  0 (N2)  messages,  where  N  is  the  total  number  of 
receivers. 

We  use  BG2  to  construct  an  algorithm  GBG2  providing  GBGA  under  MTS 
M3.  Consider  a  network  consisting  of  a  set  of  transmitters  |TJ,  |7'|=2f+l 
and  a  set  of  receivers  {tfj,  |/?)=2r  +  l,  t<r,  and  all  the 

failure-free  transmitters  have  the  same  value. 

Algorithm  GBG2: 

(1)  Each  transmitter  sends  its  value  to  all  receivers. 

(2)  Each  receiver  computes  the  median  of  the  2f  +  1  values  received. 

(3)  Each  receiver  acts  as  transmitter  of  the  value  computed  in  step  2  to  all 
other  receivers  using  algorithm  BG2  parameterized  for  a  maximum  of  r-t  -1 
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malfunctioning  receivers. 

(4)  Each  receiver  computes  the  median  of  the  2r  values  arrived  at  in  step  3 
and  the  value  computed  in  step  2.  This  median  is  the  final  value  agreed  on. 

Thm.  3.2:  Algorithm  GBG2  provides  GBGA  under  MTS  M 3  in  r—  t  +1  phases. 

Proof:  Steps  1  and  2  contribute  one  phase  and  steps  3  and  4  contribute  r-t 
phases.  Hence  a  total  of  r  —  t  + 1  phases  is  required. 

Case  1:  A  majority  of  malfunctioning  transmitters  exists.  Therefore  there 
are  at  most  r-t- 1  malfunctioning  receivers.  Hence,  by  the  correctness  of 
BG2,  each  failure-free  receiver  has  the  same  set  of  2r  +  l  values  whose 
median  it  computes  in  step  4.  Therefore,  all  failure-free  receivers  unani¬ 
mously  agree  on  some  value. 

Case  2:  A  majority  of  failure-free  transmitters  exists  and  they  have  the  com¬ 
mon  value  v.  In  step  2,  each  failure-free  receiver  r  computes  the  value  v, 
and  in  the  first  phase  of  step  3,  sends  this  value,  signed,  to  all  other 
receivers.  Moreover,  since  this  is  the  only  value  it  signs  as  transmitter  in 
step  3.  each  failure-free  receiver  will  agree  on  v  as  the  value  transmitted  by 
r  in  step  3.  Since  a  majority  of  receivers  is  failure-free,  each  receiver  will 
compute  v  as  the  final  value  in  step  4. 


For  this  algorithm  again,  the  communication  overhead  and  the  number 
of  phases  is  a  function  of  the  difference  in  the  degrees  of  replication  of  the 
read-  and  write-  variables. 
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3.8.  Conclusion 

In  this  chapter,  we  introduced  the  correctness  aspect  of  the  availability 
attribute  of  global  information.  Correctness  becomes  important  when  toler¬ 
ance  to  malfunctioning  sites  is  required.  We  advanced  some  theoretical  con¬ 
siderations  for  dealing  with  this  mode  of  failure.  It  is  clear  from  the  above 
discussion  that  the  G5CA  can  be  used  in  a  flexible  manner  to  obtain  the 
required  degree  of  tolerance  of  malfunctions.  GBGA  protocols  can  also  be 
mixed  with  cheaper  protocols  which  provide  a  lower  degree  of  tolerance  in 
such  a  way  that  meaningful  guarantees  can  be  given  as  to  the  correctness  of 
the  global  information. 

It  is  evident  that  malfunctions  are  expensive  to  cope  with.  But  the  mal¬ 
function  as  a  model  of  failure  is  appealing  since  it  represents  the  worst  kind 
of  faulty  behavior  a  site  can  exhibit.  It  would  require  very  complex  reasoning 
to  show  the  probability  of  hardware  or  software  failures  which  could  result  in 
a  site  malfunction  to  be  small  enough  to  ignore.  Further  even  if  such  reason¬ 
ing  could  be  provided,  a  site  could  be  taken  over  by  a  malicious  agent.  For 
these  reasons,  we  believe  that  techniques  must  be  developed  to  deal  with 
malfunctions.  One  possible  approach  around  the  difficulty  of  the  high  costs 
of  such  techniques  is  to  attempt  to  use  them  sparingly  and  selectively  and  we 
have  explored  this  approach  in  this  chapter. 

In  our  approach,  the  information  is  selectively  replicated  to  different 
degrees,  depending  on  its  criticality.  If,  in  spite  of  this  replication,  the 
correct  value  of  some  of  this  information  becomes  unavailable  due  to  the 
malfunctioning  of  a  number  of  sites,  we  try  to  prevent  updates  from  pro¬ 
pagating  error  to  information  whose  correct  value  is  still  available.  A  combi¬ 
nation  of  an  acceptability-checking  step  in  which  assertions  are  used  to  weed 


out  wrong  values  of  the  read-variable,  and  a  unanimity-reaching  step  (  whose 
requirements  are  formalized  in  the  GBGA  )  which  provides  consensus  on  the 
value  of  the  read -variable,  is  used  for  this  purpose.  We  presented  a  variety  of 
protocols  which  achieve  GBGA  for  different  kinds  of  bounds  on  the  number  of 
malfunctioning  sites.  These  protocols  differ  in  the  amount  of  malfunction- 
tolerance  they  provide  and  in  their  associated  costs,  and  thus  allow  the 
designer  to  make  appropriate  tradeoffs. 


CHAPTER  4 


DEADLOCK  DETECTION  IN  DISTRIBUTED  DATABASE  SYSTEMS 

4.1.  Introduction 

In  this  chapter,  we  present  centralized  and  distributed  algorithms  for 
deadlock  detection  in  distributed  database  systems.  These  algorithms  use  a 
clock  facility  to  ensure  that  deadlocks  indicated  really  exist  and  that  no 
existing  deadlocks  go  undetected. 

In  Section  4.2.  the  various  approaches  available  for  deadlock  handling 
are  discussed.  In  Section  4.3,  race  conditions  that  complicate  deadlock 
detection  in  distributed  systems  are  discussed.  Section  4.4  introduces  ter¬ 
minology  and  lists  some  assumptions.  In  Sections  4.5  and  4.6,  centralized 
and  distributed  schemes  for  deadlock  detection  are  presented,  along  with 
past  work  in  the  area. 

4.2.  Approaches  to  Deadlock  Handling 

Deadlock  may  be  described  as  a  situation  of  mutual  wait  among  a  set  of 
blocked  processes,  each  of  which  is  waiting  to  acquire  one  or  more  resources 
held  by  other  processes  in  the  set.  The  easiest  approach  to  deal  with 
deadlock  is  to  use  timeouts  to  abort  any  process  that  has  been  waiting  too 
long  (  or  to  abort  the  process  that  has  been  causing  another  to  wait  too 
long  ).  Though  this  approach  is  feasible  for  lightly  loaded  systems  in  which 
contention  is  rare,  it  runs  into  difficulties  in  congested  situations.  At  such 
times,  timers  will  run  out  often  causing  mahy  processes  to  be  aborted,  and 
prolonging  the  congestion  [GRA  78].  Other  deficiencies  of  this  approach  are 


cyclic  restart  or  livelock  [ISL  80],  and  wastage  of  resources  arising  from 
aborted  computations. 

Three  approaches  have  been  developed  to  deal  with  the  deadlock  prob¬ 
lem:  prevention,  avoidance  and  detection. 

In  deadlock  prevention  techniques,  the  requests  for  resources  are  con¬ 
strained  to  occur  in  particular  ways  so  that  deadlocks  never  occur.  Such 
techniques  include  requesting  all  resources  needed  by  the  process  at  once, 
imposing  a  total  ordering  on  the  resources  and  requesting  needed  resources 
in  this  order,  the  WOUND-WAIT  and  WAIT-DIE  algorithms  of  [ROS  77],  These 
techniques  restrict  the  amount  of  concurrency  as  a  result  of  the  constraints 
they  impose.  Further,  the  first  two  approaches  are  not  appropriate  for  data¬ 
base  systems,  since  it  is  not  always  possible  to  predict  ahead  of  time  which 
resources  will  be  needed  by  a  process. 

Deadlock  avoidance  techniques  permit  the  granting  of  a  requested 
resource  only  if  to  do  so  would  still  allow  all  processes  at  least  one  way  to 
complete  execution.  Habermann’s  algorithm  [HAB  66]  is  the  best-known 
deadlock  avoidance  scheme.  Since  a  worst  case  scenario  for  future  requests 
is  assumed  in  determining  if  a  resource  grant  is  safe,  concurrency  is  still 
restricted.  The  deficiency  of  having  to  know  ahead  of  time  which  resources 
are  going  to  be  required  is  also  present  in  avoidance  schemes.  Further,  in  a 
distributed  system,  the  computation  of  whether  a  resource  grant  is  safe  or 
not,  requires  knowledge  of  the  states  of  the  various  processes  at  the  various 
sites  in  the  system.  Hence  it  is  difficult  to  do  resource  allocation  in  an 
efficient  yet  decentralized  manner. 

Deadlock  detection  techniques  allow  a  maximum  of  concurrency  by 
granting  resources  whenever  they  are  available.  At  appropriate  times,  the 


status  of  resources  and  processes  in  the  system  is  examined  to  see  if  a 
deadlock  exists.  In  order  to  do  so,  this  status  is  maintained  in  the  form  of  a 
graph  in  which  the  nodes  represent  processes  and  resources,  the  process- 
to-resource  arcs  represent  outstanding  requests  and  the  resource-to-process 
arcs  represent  possession  of  resources  by  processes.  The  necessary  and 
sufficient  conditions  for  deadlock  in  systems  containing  reusable  resources, 
e.g.,  flies,  memory,  etc.  and/or  consumable  resources,  e.g.,  messages,  have 
been  developed  in  [HOL  72].  In  the  case  of  distributed  databases,  under  the 
assumption  that  all  resources  are  one-of-a-kind  and  that  a  process  must  wait 
till  all  resources  it  has  requested  have  been  granted  before  it  can  proceed, 
the  necessary  and  sufficient  condition  is  the  existence  of  a  cycle  in  the 
process-resource  graph. 

Deadlock  detection  in  a  distributed  system  may  be  done  in  a  central¬ 
ized,  hierarchical  or  distributed  manner.  In  centralized  detection,  a  single 
site  is  designated  as  the  deadlock  detector.  It  collects  the  status  of 
processes  and  resources  in  the  system  and  checks  for  deadlocks  in  the 
assembled  graph.  (  Detection  of  deadlocks  confined  to  one  site  may  be  done 
locally.  )  The  disadvantage  of  this  method  is  its  vulnerability  to  failure  of  the 
deadlock-detecting  site.  Also,  if  the  network  is  large,  the  load  imposed  on 
the  deadlock-detecting  site  may  be  too  large. 

Both  of  the  above  problems  are  ameliorated  in  the  hierarchical  method. 
Here,  the  sites  are  partitioned  into  a  hierarchy  of  clusters,  with  each  cluster 
having  a  deadlock  detector  site.  A  deadlock  confined  to  sites  within  a  cluster 
is  detected  by  the  local  deadlock  detector;  a  deadlock  spanning  multiple 
clusters  is  detected  by  the  deadlock  detector  in  the  lowest  cluster  which  is  a 
parent  of  all  the  clusters  involved.  Here,  as  in  the  centralized  case,  detec- 


tion  of  a  deadlock  may  be  delayed  by  the  failure  of  sites  other  than  the 
deadlocked  sites.  There  is  also  the  problem  of  choosing  the  clusters 
appropriately  in  order  that  most  of  the  deadlock  computation  may  be  done 
at  the  local  cluster  level  instead  of  having  to  refer  to  higher  levels  in  the 
hierarchy. 

In  the  distributed  scheme,  the  deadlock  detecting  facility  is  distributed 
equally  among  all  the  sites  in  the  network.  In  general,  distributed  schemes 
involve  more  communication  overhead  than  the  centralized  schemes.  This 
happens  because  in  distributed  schemes  graph  traversals  initiated  at 
different  points  in  the  process-resource  graph  in  order  to  check  for 
deadlocks,  go  over  the  same  portions  of  the  graph.  This  repetition  is  to  some 
extent  unavoidable  in  a  distributed  algorithm.  The  advantages  of  distribution 
are  that  the  detection  of  a  deadlock  involves  only  the  sites  involved  in  the 
deadlock.  Hence,  the  vulnerability  to  failures  of  the  designated  deadlock 
detecting  sites  which  characterizes  the  centralized  and  hierarchical  schemes 
is  not  present  here. 

4.3.  Race  Conditions  in  Deadlock  Detection 

In  a  single  computer  system,  the  deadlock  detector  can  stop  all  activity 
in  the  computer,  while  it  examines  the  necessary  tables,  queues,  etc.,  to  con¬ 
struct  the  process-resource  graph.  In  the  case  of  a  distributed  system,  it  is 
not  feasible  to  stop  the  entire  system  in  order  to  take  a  similar  snapshot  of 
the  processes  and  resources  in  the  system  and  the  messages  that  may  be  in 
transit.  Therefore  the  status  of  the  processes  and  resources  at  each  site 
must  be  recorded  asynchronously  and  the  global  status  computed  in  a  con¬ 
sistent  manner  from  these  recordings.  A  Complicating  factor  here  is  that 
messages  may  take  arbitrary  periods  of  time  to  reach  their  destinations. 


As  an  illustration  of  the  problems  involved,  consider  two  sites  Si  and  52 
of  a  network  in  which  a  third  site  DD  acts  as  a  centralized  deadlock  detector. 
Suppose  process  Pi  and  resource  R 1  reside  at  site  Si  and  process  P2  and 
resource  RZ  reside  at  site  S2  (  Fig.  4.1  ).  Initially  Pi  and  PZ  request  and 
acquire  resources  R 1  and  RZ  respectively.  Next  PZ  sends  a  message  to  Si 
requesting  resource  R 1,  and  gets  blocked.  At  this  point  the  resource  con¬ 
troller  at  site  SI  reports  to  DD  indicating  Pi  to  be  in  possession  of  PI,  and 
PZ  to  be  in  wait  for  it. 

Next.  Pi  releases  Pi  and  requests  P2.  The  corresponding  request  mes¬ 
sage,  arriving  at  S2,  causes  PZ  to  get  blocked.  The  resource  controller  at 
SZ  now  reports  to  DD  indicating  PZ  to  be  in  possession  of  RZ  and  Pi  to  be 
waiting  for  it.  On  putting  these  two  reports  together,  DD  detects  a  cyclic 
wait  P2-»Pl-»Pl-»P2-»P2  and  may  detect  a  deadlock  unless  its  algorithm 
takes  other  steps  to  verify  that  a  cycle  really  exists.  In  this  case  the  cycle 
does  not  exist,  since  Pi  is  no  longer  in  possession  of  PI. 

Another  danger  is  that  a  deadlock  detection  algorithm  may  fail  to  detect 
a  deadlock  that  really  exists,  typically  this  happens  when  an  algorithm  fails 
to  take  into  account  that  messages  may  arrive  after  indefinite  delays.  As  a 
result,  all  the  deadlock  computations  that  arise  as  a  result  of  a  sequence  of 
events  causing  a  deadlock,  may  operate  on  incomplete  information,  and  thus 
the  deadlock  goes  undetected. 

4.4.  Terminology  and  Assumptions 

The  database  is  accessed  and  updated  through  transactions .  A  transac¬ 
tion  consists  of  one  or  more  processes,  called  agents.  An  agent  may  request 
to  acquire  either  of  two  kinds  of  resources:  reusable  resources  (  which  will  be 
referred  to  simply  as  resources  from  now  on  )  and  consumable  rest, .trees  ( 
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which  will  be  referred  to  as  messages  from  now  on  ).  Messages  are  used  by 
agents  of  a  transaction  to  co-ordinate  their  work.  In  the  case  of  resources,  it 
is  assumed  that  any  agents  currently  in  possession  of  a  resource  must  all 
relinquish  it  before  any  of  the  agents  currently  waiting  for  the  resource  can 
gain  possession  of  it.  This  assumption  is  necessary  to  make  existence  of  a 
cycle  a  sufficient  condition  for  deadlock. 

A  transaction  agent  may  be  in  one  of  two  states:  acfive  or  waiting .  Ini¬ 
tially  it  is  in  active  state.  It  may  enter  waiting  state  if: 

(i)  it  wishes  to  receive  messages  from  each  of  a  set  of  agents  belonging  to  the 
same  transaction.  For  example,  the  agent  co-ordinatir.g  the  commit  pro¬ 
cessing  for  the  transaction  may  enter  waiting  state  and  remain  there  till  it 
has  received  prepared -to  -commit  messages  from  all  other  agents  of  the 
transaction. 

(ii)  it  wishes  to  acquire  each  of  a  set  of  resources  (in  specific  modes  (e.g. 
shared  .exclusive  )). 

When  all  the  messages  or  resources  have  been  received,  the  agent  re¬ 
enters  active  state. 

4.S.  Centralized  Algorithms 

Early  work  in  centralized  deadlock  detection  [GRA  78,  GOL  77]  does  not 
correctly  solve  the  problem  of  race  conditions.  The  algorithm  of  [GRA  78] 
works  under  the  assumption  of  two  -phase  usage  of  resources  (  explained 
below  ).  However,  if  this  assumption  is  made,  an  algorithm  which  is  much 
more  efficient  can  be  constructed  as  shown-  later.  Similarly,  a  timing  prob¬ 
lem  in  the  centralized  algorithm  of  [GOL  77]  was  shown  in  [SUN  78]. 

In  [HO  79]  two  algorithms  were  proposed  to  address  the  problem  of  race 
conditions.  The  first  is  a  two-phase  algorithm  in  which  first  one  set  of  reports 
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is  collected  from  all  sites  and  then  another  set  is  collected.  Only  the  infor¬ 
mation  common  to  the  two  sets  of  reports  is  assembled  to  check  for  global 
deadlocks  and  it  is  shown  that  spurious  indications  are  thereby  avoided.  In 
the  second  algorithm,  each  inter-site  arc  is  replicated  at  both  the  sites 
involved  and  a  deadlock  is  detected  after  only  one  set  of  reports  is  received. 

However,  these  algorithms  require  that  all  sites  in  the  network  which 
access  the  resources  as  well  as  the  sites  controlling  the  resources  should 
report  to  the  deadlock  detector.  Typically,  the  number  of  sites  controlling 
the  resources  will  be  much  smaller  than  the  number  of  sites  accessing  the 
resources.  For  example,  in  a  network  running  distributed  INGRES  [STO  79], 
the  control  is  done  only  from  the  primary  sites  in  the  network.  In  the  next 
two  sections,  we  show  under  what  conditions  we  can  construct  algorithms 
which  utilize  reports  from  only  the  resource-controlling  sites. 

4.5.1.  Detection  under  Conditions  of  2-Phase  Resource  Usage 

By  2-phase  usage  [ESW  76]  of  resources,  we  mean  that  the  execution  of  a 
transaction  can  be  divided  into  two  distinct  phases:  a  growing  phase  and  a 
shrinking  phase,  the  latter  following  the  former.  In  the  growing  phase, 
resources  are  acquired  but  not  released.  In  the  shrinking  phase,  resources 
are  released  but  not  acquired.  The  implication  is  that,  under  this  discipline, 
a  transaction  does  not  release  any  resources  until  after  it  has  acquired  all 
the  resources  it  needs. 

Suppose  that  only  the  resource  controllers  send  reports  to  the  deadlock 
detector,  giving  for  each  resource  the  list  of  transactions  in  possession,  and 
the  list  of  transactions  in  wait.  The  identity  of  the  specific  agent  of  the  tran¬ 
saction  which  is  in  possession  or  waiting  is  not  given.  Periodically,  the 
deadlock  detector  takes  the  latest  report  from  each  controller  and 


Thm  4.1:  Suppose  a  cycle  is  found  in  the  transaction-resource  graph  created 
by  the  above  procedure.  Then  the  transactions  in  the  cycle  are  deadlocked. 

Ppoof:  Let  the  cycle  take  the  form  shown  in  Fig.  4.2. 

For  each  resource  node  in  the  graph,  the  incoming  arcs  represent  tran¬ 
saction  agents  waiting  to  acquire  the  resource  and  the  arcs  running  out  of 
the  node  transactions  in  possession  of  the  resource.  By  our  earlier  assump¬ 
tion  that  all  requested  resources  must  be  acquired  before  the  agent  making 
the  request  can  proceed,  none  of  the  arcs  representing  a  wait  for  the 
resource  can  vanish  before  all  the  arcs  representing  possession  of  the 
resource  vanish. 

For  each  transaction  node,  the  incoming  arcs  represent  resources  in 
possession  and  the  outgoing  arcs  represent  resources  the  transaction  is  wait¬ 
ing  for.  Since  each  transaction  uses  resources  in  a  2-phase  manner,  no 
incoming  arc  at  a  transaction  node  can  vanish  before  all  its  outgoing  arcs 
vanish. 

Let  x  <y  indicate  that  arc  y  can  vanish  only  after  arc  x  vanishes.  Apply¬ 
ing  the  arguments  given  above  to  the  incoming  and  outgoing  arcs  at  the 

nodes  R\,T2,R2,T2 . TN.RN  we  get  ai<ag<as<....<ag\.g<agjV.I<ag^  Le. 

ai<&w-  But  applying  them  to  the  incoming  and  outgoing  arcs  at  node  Tl,  it 
follows  that  This  contradiction  implies  that  no  arc  in  the  cycle  can 

vanish  (  unless  one  of  the  transactions  in  the  cycle  is  aborted  ).  Hence  the 


Further,  every  genuine  deadlock  will  be  detected,  since  none  of  the  arcs 
in  the  deadlock  cycle  will  vanish  til!  the  deadlock  is  broken  and  thus  the 
cycle  must  eventually  appear  in  the  deadlock  detector's  assembled  graph. 

Thus,  we  have  shown  that  when  resource  usage  is  2-phase,  status  reports 
need  be  collected  only  from  the  resource-controllers.  All  algorithms  for  cen¬ 
tralized  detection  published  hitherto  have  involved  gathering  reports  from 
all  sites  in  the  network.  Two-phase  usage  of  resources  is  used  in  many  sys¬ 
tems  to  satisfy  the  requirement  of  strxaLxzabHity  of  transaction  execution 
histories  [ESW  76].  Therefore  the  detection  procedure  described  above  can 
be  utilized  in  these  systems  e.g.  distributed  INGRES . 

4.5.2.  Detection  under  Conditions  of  non -2-phase  Resource  Usage 

In  some  database  systems,  resource  usage  is  not  constrained  to  be 
necessarily  2-phase.  For  example.  System  R  [AST  76]  allows  three  different 
degrees  of  data  consistency  from  which  the  user  may  specify  one  for  his 
transaction.  The  highest  degree  of  consistency  is  the  one  corresponding  to 
2-phase  resource  usage.  The  advantage  of  using  lower  degrees  of  consistency 
is  less  lock  contention. 

When  the  resource  usage  is  not  necessarily  2-phase,  the  possibility  of 
false  indication  of  deadlock  in  situations  such  as  the  one  illustrated  in  Fig. 
4.1.  arises.  For  the  most  general  case,  where  the  agents  of  a  transaction 
may  execute  in  parallel,  an  algorithm  such  as  those  described  in  [HO  79, 
GRA  7B],  in  which  all  sites  in  the  network  have  to  send  status  reports  to  the 
deadlock  detector,  is  necessary.  However,  there  is  one  transaction  model  in 
which  intra-transaction  concurrency  is  not  present.  This  is  the  "migrating" 
transaction  model  [ROS  77,  GRA  8lJ,  used  in  System  /?*,  the  distributed  ver¬ 
sion  of  System  R.  Here,  a  transaction  starts  at  one  site  and  moves  from  site 


to  site  as  necessary  to  access  remote  resources.  At  every  site  visited  by  the 
transaction,  there  is  a  single  agent  that  does  the  work  at  that  site  on  behalf 
of  that  transaction.  The  agent  of  the  transaction  at  the  site  that  the  transac¬ 
tion  is  currently  visiting  is  called  the  front  of  the  transaction.  A  list  of 
unreleased  resources  is  carried  along  by  the  transaction  front  and  messages 
are  sent  releasing  them  when  the  transaction  terminates.  (  Acquired 
resources  may  be  released  prior  to  transaction  termination,  for  example  if 
the  highest  degree  of  data  consistency  is  not  desired  for  the  transaction.  )  It 
is  assumed  that  the  transaction  front  does  not  migrate  from  a  site  before  it 
has  acquired  the  resources  it  has  requested  while  at  that  site. 

A  global  clock  facility  fulfilling  Lampson's  clock  rules  [LAM  78a],  men¬ 
tioned  in  Section  2.2.2  of  Chapter  2.  is  assumed  to  exist.  Timestamps  are 
assigned  to  resource  requests  using  this  facility.  Uniqueness  of  timestamps 
is  assured  by  taking  the  clock  reading  to  include  the  site  id  as  its  less 
significant  part.  These  rules  imply  that 

(i)  given  two  requests  issued  by  a  transaction  front  while  at  a  given  site,  the 
timestamp  assigned  to  the  later  request  is  greater  than  that  assigned  to  the 
earlier  request,  and 

(ii)  if  the  transaction  front  migrates  from  site  a  to  site  b,  then  timestamps 
associated  with  requests  issued  at  b  are  greater  than  those  associated  with 
requests  issued  by  the  front  when  at  a. 

Each  request  for  a  resource  sent  to  a  resource  controller  by  a  transac¬ 
tion  front  is  accompanied  by  the  timestamp  assigned  to  the  request.  This 
timestamp  is  retained  by  the  controller  tall  it  is  informed  about  the  release 
of  the  resource  by  the  transaction.  The  resource  may  be  released  by  the 
front  at  a  site  other  than  the  one  where  it  was  requested  and  acquired. 


Further,  as  mentioned  before,  the  transaction  front  maintains  a  list  of 
resources  which  it  has  acquired  but  not  released  and  the  corresponding 
request  timestamps. 

From  time  to  time,  a  site  may  receive  a  conf  vrm_pume rsh.ip  message 
from  the  deadlock  detector.  This  message  specifies  a  transaction  T,  a 
resource  R  and  a  timestamp  t .  The  site  returns  a  positive  acknowledgement 
if 

(i)  the  front  of  the  transaction  T  is  currently  at  the  site  and 

(ii)  in  the  list  of  unreleased  resources  maintained  by  the  transaction  front, 
the  resource  R  is  present  with  associated  timestamp  f. 

The  site  returns  a  negative  acknowledgement  otherwise. 

Periodically,  every  resource  controller  sends  a  report  to  the  deadlock 
detector,  giving  for  each  resource  under  its  control 

(i)  the  set  of  transactions  in  possession,  along  with  the  timestamps  of  the 
corresponding  requests  and 

(ii)  the  set  of  transactions  waiting,  along  with  the  timestamps  of  the 
corresponding  requests. 

The  deadlock  detector  executes  the  following  algorithm; 

(i)  Periodically,  it  selects  the  latest  report  from  each  resource  controller  and 
assembles  the  reports. 

(ii)  If  one  or  more  cycles  is  detected  in  the  assembled  graph,  the  following 
procedure  is  executed  for  each  cycle  C : 

Let  C  be  the  sequence  of  arcs 

TO+BO-Tl-Rl . -T(N~1)+R(N-1)-T0. 

Let  fw(i).  i=0,...N-l,  be  the  timestamp  associated  with  the  arc  Ti-*Ri. 
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Let  1.  be  the  timestamp  associated  with  the  arc 

R(i-l)-*Ti  and  let  f#(0)  be  the  timestamp  associated  with  the  arc 

r(n -i)-*ro. 

If  for  every  transaction  in  cycle  C,  then 

(a)  to  every  site  ORlG(fw(i)),i  =  l,...N-l,  send  a  confirm_pxunership 
message  {Tyi.R(i'~l).tw{i))  [  ORlG(f)  represents  the  site  at  which 
timestamp  t  is  issued,  and  can  be  computed  from  t  itself.  ]  and  to 
OR]G(iw(0))  send  a  confirm_pujnership  message  (7’0./?(W-l),ftu(0)). 

(b)  if  all  acknowledgements  are  positive,  declare  cycle  C  to 
represent  a  deadlock. 

Thm  4.2:  Every  cycle  C  declared  to  represent  a  deadlock  represents  a  true 
deadlock. 

Proof :The  argument  is  the  similar  to  that  for  Thm  4.1,  namely  that  for  each 
node  in  the  cycle,  the  incoming  arc  can  vanish  only  after  the  outgoing  arc. 
This  holds  for  resource  nodes  for  the  same  reason  as  before.  It  holds  for  a 
transaction  node  because 

(i)  since  f,(t)£fw(i),  the  resource  request  for  the  acquired  resource 
occurred  at  the  same  time  or  before  the  resource  requested  by  Ti  and 

(ii)  at  a  time  later  them  fw(i).  the  acquired  resource  has  not  been  released, 
since  a  positive  acknowledgement  from  ORlG(fw(i))  is  received. 

Therefore  C  represents  a  genuine  deadlock. 


Thm  4.3.  Every  genuine  deadlock  is  detected. 

IVoof.  Let  the  deadlock  be  represented  by  the  cycle 
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C:r0-*^0-»Tl....-r(^-l)-/?(A,-l)-»7'0  with  t,  (i).  *w(i).  <=0,1...N-1  being 
defined  as  before.  Since  the  deadlock  is  genuine,  none  of  its  arcs  will  vanish 
until  the  deadlock  is  broken.  Hence  the  cycle  will  be  detected  by  the 
deadlock  detector.  Further,  since  the  resource  acquired  by  Tl,  i=O...Af-l,  in 
the  cycle  C  must  have  been  requested  at  the  same  time  or  before  it  requests 
Ri,  t,(i)stw(i).  Therefore,  the  deadlock  detector  will  send  out 
confirm  —ownership  messages  to  ORIG^OO),  t=0,...Af— 1.  Since  the 
deadlock  is  genuine,  the  front  of  transaction  Ti  will  be  trapped  at  the  site  it 
requested  Ri,  i.e.  ORlG(fw(i)).  Hence  positive  acknowledgements  will  be 
received  for  all  the  confirm_pwnership  messages  sent  out  and  a  deadlock 
will  be  declared. 


Since  a  cycle  is  likely  to  occur  only  rarely  in  the  assembled  graph  at  the 
deadlock  detector,  the  confirm_pumership  messages  will  occur  only  rarely. 
Hence,  the  participation  of  non-resource-controlling  sites  will  be  only  rarely 
required  for  deadlock  detection.  The  lower  communication  overhead  that 
this  algorithm  causes  is  obtained  at  the  expense  of  a  larger  time  to  detect  a 
deadlock  compared  to  the  one-phase  algorithm  in  [HO  79].  The 
confirm -ownership  messages  and  acknowledgements  constitute  an  extra 
phase  which  increases  the  detection  time  by  one  round-trip  delay. 

Fig.  4.3.a  shows  a  case  where  the  extra  phase  is  not  initiated  since 
fa(l)>fv,(l).  Fig.  4.3.b  shows  a  case  which  does  invoke  the  extra  phase. 


4.6.  Distributed  Detection 


4.6.1.  Put  Work 


The  first  distributed  detection  algorithms  are  in  [CHA  74,  MAH  76].  In 
both,  issued  requests  for  resources  are  divided  into  those  that  are  incapable 
of  causing  a  global  deadlock  and  those  that  are  capable  of  doing  so.  In  the 
latter  case,  resource  tables  from  all  sites  in  the  network  are  assembled  to 
check  if  a  global  deadlock  exists.  Besides  causing  excessive  communication 
overhead,  both  algorithms  have  been  shown  in  [GOL  77]  to  fail  in  detecting 
certain  kinds  of  deadlocks.  In  the  "on-line"  algorithm  of  [1SL  7B],  a  complete 
global  view  of  resource  status  is  maintained  at  each  site.  This  algorithm  also 
suffers  from  excessive  communication  overhead. 

[GOL  77]  presents  a  distributed  algorithm  which  is  similar  to  many 
subsequently-appearing  algorithms  [MEN  79.  CHA  82,  OBE  82,  BAD  83].  The 
common  element  in  these  algorithms  is  forward  traversal  of  the  global 
status  graph  (  i.e.  traversal  in  the  direction  of  the  graph  arcs  ).  which  may 
cause  the  deadlock  computation  to  migrate  from  site  to  site  as  intersite  arcs 
are  encountered.  In  [BAD  83],  a  request  from  a  transaction  is  accompanied 
by  its  previous  lock  history;  this  hastens  detection  of  intersite  deadlock 
cycles  of  length  two.  In  [OBE  82],  an  intersite  arc  is  traversed  only  if  the  tail 
node  id  is  greater  than  the  head  node  id;  this  optimization  reduces  the 
number  of  deadlock  detection  messages  caused  by  a  cycle  of  arcs  by  half.  In 
[MEN  79],  the  results  of  graph  traversals  are  also  recorded  in  the  graph  in 
the  form  of  arcs  representing  indirect  independencies  (i.e.  a  chain  of  arcs 
ol-»o2-*....-»aN  may  cause  the  addition  of  an  arc  al—aN).  However,  there  is 
no  provision  in  the  algorithm  for  updating  this  "condensed”  information,  and 
hence  false  deadlocks  may  be  detected.  It  was  shown  in  [GLJ  80]  that  the 
algorithm  also  fails  to  detect  some  deadlocks.  One  of  the  authors  of 


[MEN  79]  proposed  a  solution,  presented  in  [GU  BO],  purporting  to  remedy 
this  last  deficiency.  But  in  [TSA  82]  it  was  shown  that  this  solution,  too.  does 
not  detect  all  deadlocks. 


[TSA  82]  proposes  a  solution  which  differs  from  previous  solutions  in  that 
it  traverses  the  graph  in  reverse,  Le.  it  proceeds  from  one  transaction  to 
another  waiting  for  the  first  to  release  some  resource.  With  forward  traver¬ 
sal  of  a  chain  of  arcs,  the  transactions  in  the  chain  yet  to  be  traversed  must 
release  one  or  more  resources  before  the  transactions  in  the  chain  already 
traversed  can  leave  their  waiting  state.  But  with  reverse  traversal,  this  is 
not  true.  Therefore,  if  a  node  already  encountered  is  re-encountered  in  the 
backward  traversed,  the  algorithm  must  verify  that  the  forward  path  to  that 
node  still  exists  before  declaring  a  deadlock.  In  fact,  [TSA  82]  indicates 
deadlock  in  some  cases  where  they  do  not  exist 

In  Fig.  4. 4.  a.  a  chain  of  arcs 

7?20-7*21-7?21-7'22-7?22....-7’2i-7?2i-7*2(i  +  l)  is  shown.  Suppose  the 
request  of  T2i  for  7? 2i  occurs  at  f  q.  In  the  algorithm  of  [TSA  82],  the  wait  of 
TZi  for  R2i  will  be  propagated  backwards  in  the  form  of  a  "reaching  edge". 
This,  when  it  reaches  the  site  where  7*21  resides,  creates  the  "resource 
reaching  edge"  7*21 -7?2i. 

In  Fig.  4.4.b,  transaction  7*2(1 +  1)  has  released  RZi  which  is  now  free. 

In  Fig.  4.4.c,  transaction  7k  acquires  RZi.  Let  this  be  at  time  tj.  Then 
it  requests  resource  7710  which  is  the  first  node  in  the  chain 

7?10-7*ll-7?ll-7*12-...-7?lO-l)-7’lJ. 


Next,  in  Fig.  4.4.d,  Tlj  requests  the  resource  7720  and  is  blocked.  Let 
this  occur  at  fj>t  j-  The  indirect  wait  of  7*21  for  7?2i  (  which  is,  unknown  to 
the  site  where  7*21  resides,  not  valid  any  more  )  is  propagated  backwards  till 


FIG.  4.<i.  COUNTEREXAMPLE  TO  ALGORITHM  IN  [TSA  8 2], 
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the  resource  reaching  edge  Tx-*R2i  is  created,  forming  a  cycle.  Moreover, 

the  timestamp  that  accompanies  the  reaching  edge  is  t g,  which  will  indicate 

that  some  transaction  reachable  by  following  arcs  forward  from  Tx  j 

requested  RZi  after  Tz  acquired  it.  Hence,  the  algorithm  declares  deadlock. 

Thus,  although  the  timestamp  mechanism  was  introduced  to  prevent  spuri¬ 
ous  indications  of  deadlock,  a  false  deadlock  is  detected  here. 

I 

It  does  not  appear  difficult  to  correct  this  error.  One  solution  appears 
to  be  to  associate  a  timestamp  of  t0  with  every  resource  reaching  edge  whose 
creation  originated  with  a  resource  request  occurring  at  to-  However,  there 
is  another  reason  why  the  algorithms  of  [TSA  82,  OBE  82.  CHA  82,  BAD  83]  are 
deficient  in  comparison  with  centralized  algorithms.  In  a  centralized  algo¬ 
rithm,  a  deadlock  can  be  detected  by  the  detector  site  one  message  delay 
after  the  last  arc  of  the  deadlock  cycle  comes  into  existence.  But  with  the 
distributed  algorithms  just  mentioned  a  delay  equal  to  the  time  to  go  around 
the  cycle  is  usually  required.  If  all  arcs  of  the  cycle  come  into  existence 
more  or  less  simultaneously  this  delay  cannot  be  avoided.  But  if  the  last  arc 
comes  into  existence  am  appreciable  time  after  all  or  most  of  the  other  arcs 
come  into  existence,  it  should  be  possible  to  reduce  the  detection  time  by 
the  use  of  "condensed"  information.  In  the  algorithm  proposed  below,  both 
forward  and  backward  traversal  are  used  to  achieve  this  goal.  Further,  a 
timestamp  mechanism  is  used  to  prevent  false  deadlock  indications.  In  this 
algorithm,  timestamps  have  no  ordering  role  to  play,  but  act  as  unique 
identifiers. 

4.6.2.  A  Distributed  Detection  Algorithm 


4.6.2. 1.  Terminolqgy 

In  the  algorithm  proposed  below,  the  information  necessary  to  detect 
deadlocks  is  represented  in  the  form  of  a  transaction_£gent-resource- 
message  (TRM)  graph  at  each  site.  In  the  graph,  there  are  agent,  resource 
and  message  nodes.  The  former  two  types  of  nodes  represent  transaction 
agents  and  resources  respectively.  T.x  represents  the  agent  of  transaction 
T  at  node  x.  For  each  pair  of  communicating  agents  T  o  and  T.b  of  a  tran¬ 
saction  T  at  sites  a  and  6  respectively,  there  can  exist  two  nodes  M(T.a.b) 
and  M(T.b,a),  the  former  representing  messages  sent  by  To  to  T.b  and  the 
latter  representing  messages  sent  by  T.b  to  To.  Let  TRM(s)  be  the  TRM 
graph  at  site  s.  Then  the  nodes  in  TRM{s)  representing  (a)  transaction 
agents  residing  at  site  s  (b)  resources  whose  lock  controllers  are  at  site  s 
and  (c)  messages  sent  to  a  transaction  agent  local  to  site  s  are  said  to  be 
focal  nodes.  Other  nodes  are  non -focal. 

Fig.  4.5  shows  an  example  involving  four  sites  a,6,c,  and  d.  Local  to  site 
a  are  the  agent  nodes  Tl.a,T2.a;  the  resource  node  /?  1  and  the  message 
nodes  M(T\,b,a)  and  M(T l.c.o).  Local  to  site  6  are  the  agent  nodes 
Tl.6,  T6.6,  the  resource  node  R 2  and  the  message  node  M(Tl,a,b).  Local 
to  site  c  are  the  agent  nodes  Tl.c,  T3.c,  the  resource  node  R 3  and  the  mes¬ 
sage  node  M(T l.o.c).  Local  to  site  d  are  the  agent  nodes  T4.d,  T5.d.  the 
resource  nodes  R 4,  R 5.  R 6. 

The  transaction  T1  is  a  distributed  transaction.  Tl.o  does  the  commit 
co-ordination  of  the  transaction;  it  also  updates  R 1.  As  can  be  seen  from  the 
figure,  1?1  has  been  locked  by  71. a.  It  is  not  waiting  for  M{T\,b,a)  since 
Tl.b  has  already  sent  th e  prepared -to  -commit  message  to  Tl.o.  Tl.6  is 
supposed  to  update  R 2  on  which  it  has  obtained  a  lock.  But  Tl.c,  which  is 


supposed  to  update  R 3,  has  not  yet  got  a  lock  on  it  and  hence  it  has  not  yet 
returned  &  prepared -to  -commit  message  to  7*1. a,  which  is  therefore  wait¬ 
ing  for  Af(Tl.c.a).  Tl.a  has  sent  one  message  each  to  Tl.6  and  Tl.c  (  con¬ 
veying  the  work  they  are  to  do  ),  hence  the  arcs  from  Af(Tl,a,6)  and 
U(Tl,a,c)  to  Tl.a  are  marked  26  and  2c  respectively  ,  signifying  that  the 
second  message  from  Tl.a  to  Tl.6  and  the  second  message  from  Tl.a  to 
Tl.c  are  in  Tl.a's  possession,  i.e.  have  not  yet  been  sent  by  it.  The  second 
part  of  the  marking  refers  to  the  site  to  which  the  message  is  being  sent. 
Since  Tl.6  has  sent  one  message  to  Tl.a,  the  arc  from  A/(Tl.6,a)  to  Tl.6  is 
marked  2a.  The  transaction  TZ.a  is  waiting  to  lock  R 1. 

At  site  c,  Tl.c  has  not  yet  sent  its  prepared  -to  -commit  message  to 
Tl.a,  hence  the  marking  la  on  the  arc  from  M(Tl,c.a)  to  Tl.c.  Tl.c  is 
waiting  to  lock  R 3,  which  has  been  locked  by  T3.c.  T3.c  is  waiting  to  lock 
R 4  whose  lock  controller  is  at  site  d.  The  number  737  on  the  corresponding 
arc  is  a  timestamp  showing  the  time  at  which  the  lock  request  was  made. 

At  site  d.  T4.d  has  locked  /?4  and  is  waiting  for  a  lock  on  R 5  which  has 
been  locked  by  T5.d.  T5.d  in  turn  is  waiting  to  lock  R 6  which  has  been 
locked  by  the  T6.6.  The  marking  600  on  the  arc  from  R 6  to  T6.6  is  the  time 
at  which  the  lock  was  granted. 

Note  that  arcs  between  nodes  local  to  two  different  sites  are  reproduced 
at  both  sites,  and  that  they  have  a  marking  corresponding  to  a  message 
number  concatenated  with  a  site  id,  or  a  timestamp.  This  marking  is  done  so 
that  the  graphs  at  different  sites  can  be  put  together  consistently  for  graph 
traversals.  Arcs  between  nodes  local  to  the  same  site  are  not  marked  and 


are  called  internal  arcs. 
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At  any  site,  arcs  running  from  non-local  nodes  to  local  nodes  are 
referred  to  as  direct  incoming  arcs(DIAs)  and  the  corresponding  local 
nodes  are  referred  to  as  in— nodes .  Arcs  running  from  local  nodes  to  non¬ 
local  nodes  are  referred  as  direct  outgoing  arcs(DOAs)  and  the  correspond¬ 
ing  local  nodes  are  referred  to  as  out  -nodes.  At  site  a  in  Fig.  4.5,  FI. a  is  an 
in-node  and  M(Tl,b,a)  and  Af(Tl,c,a)  are  out-nodes.  The  arcs 
Af(71.a,6)-*Tl.a  and  A/(7l,a,c)-»7l.a  are  DLAs  and  the  arcs 
A#(71,c,a)-*Tl.c  and  JH(Tl,b  ,a)-*71.b  are  DOAs.  Note  that  a  DIA  at  one  site 
is  a  DOA  at  another,  and  vice  versa. 

The  outgoing  arcs  defined  so  far  represent  direct  relationships  between 
out-nodes  at  one  site  and  in-nodes  at  another.  In  order  to  speed  up  deadlock 
detection.  condensed  information  in  the  form  of 
indirect  outgoing  arcs(JOAs)  are  maintained.  lOAs  represent  indirect  rela¬ 
tionships  between  out-nodes  at  one  site  and  in-nodes  at  a  different  or  the 
same  site. 

Fig.  4.6  shows  Fig. 4. 5  augmented  with  IOAs  shown  in  dotted  arcs.  For 
example,  the  out-node  M(Tl,a,b)  at  site  b  and  the  in-node  76.6  at  the  same 
site  have  a  sequence  of  arcs  connecting  them: 
U(Tl,a,b)-*ib  Tl.a,  Fl.a-Af  (Fl,c,a),  M (Fl.c  ,a)-i„ Fl.c .  71.c-*f?3, 

R3-*T3.c ,  T3.c-*niR4,  R4-»T4.d,  T4.d-*R5,  R5-»T5.d,  75.rf-*/?6, 

R6-*00cTG  b .  This  connecting  sequence  of  arcs  is  represented  concisely  by 
the  10A  Af(Fl,a,6)-»eoc7'6.6  at  site  b.  Each  IOA  is  associated  with  a  DOA, 
namely  the  DOA  which  is  first  in  the  connecting  sequence.  Thus  the  10A 
Af(7l,a,6)-»aoo76.6  is  associated  with  the  DOA  Af(Fl,a,6)-*eb  Fl.a.  If  76.6 
tries  to  lock  R 2  in  a  mode  incompatible  with  the  mode  in  which  71.6  has 
locked  it,  the  arc  76.6  -*R 2  will  be  created  causing  a  deadlock  to  be  detected 


at  site  6  due  to  the  cycle  71.6  -»A/(7’l,a,6)-*eoo7’6.6  -*R2-*T\.b .  Thus  the 
deadlock  is  detected  at  once  instead  of  having  to  wait  for  several  message 
delays  till  information  from  all  four  nodes  is  gathered  as  in.  e.g.,  [OBE  82], 

If  an  arc  a  runs  from  node  x  to  node  y,  we  refer  to  x  as  the  head  and  to 
y  as  the  tail  of  a.  respectively.  DIAs  and  DOAs  have  message  numbers  con¬ 
catenated  with  site  ids  or  timestamps  associated  with  them,  which  are  called 
the  marks  of  these  arcs.  The  arc  -identifier  of  a  DIA  or  DOA.  d.  is  the  pair  of 
values  (head(d).mark(d)).  The  lOAs  associated  with  a  DOA  are  stored  as  arc- 
identifiers  in  the  ioas  field  of  the  DOA. 

The  algorithm  utilizes  timeout  periods  in  such  a  manner  that  under 
lightly  loaded  conditions,  i.e.,  when  requested  locks  and  messages  become 
available  to  the  requestor  within  the  specified  timeout  periods,  and  acquired 
locks  are  released  within  specified  timeout  periods,  no  checks  for  the 
existence  of  deadlock  cycles  or  attempts  to  construct  lOAs  in  order  to  hasten 
the  detection  of  deadlocks  occur.  For  this  reason,  with  each  arc  there  is  an 
associated  field  timed jrut  taking  the  values  TRUE  or  FALSE  according  as 
the  timeout  period  for  the  arc  has  completed  or  not.  [In  some  cases,  as  will 
be  seen,  there  will  be  no  need  to  even  start  the  timeout,  so  the  timed_out 
field  takes  the  value  TRUE  as  soon  as  the  arc  is  created.] 

Not  only  a  wait  by  an  agent  for  a  resource  or  message,  but  also  the 
granting  of  a  resource  to  a  message  may  cause  a  deadlock  [1SL  80].  If,  how¬ 
ever,  the  grant  causes  the  transaction  agent  to  enter  define  state,  no 
deadlock  can  occur  as  a  result  of  such  a  resourct  grant  and  the  algorithm 
can  be  optimized  accordingly. 

In  the  next  section,  we  present  the  algorithm  for  deadlock  detection  as 
executed  at  a  site  s.  In  order  to  distinguish  units  of  communication 
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exchanged  between  between  transaction  agents  and  those  between  resource 
controllers  for  the  purpose  of  resource  allocation  and  deallocation  as  well  as 
deadlock  detection,  we  refer  to  the  latter  as  signals,  the  former  being 
referred  to,  as  before,  as  messages. 

There  are  5  kinds  of  signals: 

(i)  resource  request  (RR):  This  signal  is  sent  when  a  lock  on  a  non-local 
resource  is  requested  by  a  local  transaction  agent.  The  signal  carries  infor¬ 
mation  identifying  the  requesting  agent,  the  name  of  the  resource  and  the 
mode  of  lock  desired,  and  is  accompanied  by  a  timestamp  TS#  generated  at 
the  requesting  site. 

(ii) restmrce  grant  (RG):  This  signal  is  sent  in  response  to  a  RR  signal,  to  the 
requesting  site.  In  addition  to  information  identifying  the  RR  signal  to  which 
it  is  a  response,  it  carries  a  timestamp  TSq  generated  at  the  site  that  is 
sending  the  RG  signal. 

(iii)  resource  Jree(RF):  This  signal  is  sent  to  the  resource  controller  at 
another  site  when  a  local  transaction  agent  no  longer  requires  a  lock  on  a 
remote  resource  under  the  control  of  that  resource  controller.  It  carries 
information  that  enables  the  resource  controller  to  delete  the  appropriate 
DOA  and  associated  IOAs. 

(iv)  agent  create  (AC):  When  a  local  agent  T.s  wishes  to  create  an  agent  at 
another  site  r,  this  signal  is  sent  to  the  site  r.  A  full  duplex  channel  is  esta¬ 
blished  between  the  two  agents. 

(v)  backward  propagation(BP):  This  signal  is  used  to  establish  IOAs  to  speed 
up  deadlock  detection.  It  has  two  fields  :  M_S£T  and  OA^JSET.  The  former  is 


a  set  of  DIAs.  with  the  tail  of  each  D1A  being  local  to  the  same  site  r,  the  site 
to  which  the  BP  signal  is  sent.  OA_£ET  is  a  set  of  arc-identifiers,  correspond¬ 
ing  to  a  subset  of  the  DOAs  at  site  s  and  associated  lOAs.  On  receipt  of  a  BP 
signal,  further  BP  signals  may  be  sent.  The  BP  signals  flow  backwards  along 
the  TRM  graphs. 

(vi)  forward  propagation  (FP):  This  signal  is  sent  in  order  to  detect  possible 
multisite  deadlocks.  Each  FP  signal  may  spark  off  further  FP  signals  at  the 
recipient  site,  and  these  signals  are  said  to  belong  to  the  same 
deadlock  computation.  Basically,  the  FP  signals  of  a  deadlock  computation 
traverse  the  TRM  graphs  in  the  forward  direction.  A  FP  signal  has  five  fields: 

(a)  ORIGIN:  This  is  the  id  of  the  site  that  began  the  deadlock  computa¬ 
tion  to  which  the  FP  signal  belongs. 

(b)  VICTIM:  This  is  the  transaction  that  is  to  be  aborted  if  the  deadlock 
computation  finds  one  or  more  deadlock  cycles. 

(c)  CHECK_gET:  This  is  a  set  of  DIAs  at  the  ORIGIN  site.  If  the  deadlock 
computation  reaches  one  or  more  DOAs  or  lOAs  at  some  site 
corresponding  to  one  or  more  members  of  the  CHECKLIST,  then  the 
transaction  VICTIM  is  deadlocked  and  must  be  aborted. 

(d)  TRAVERSED_J3ET:  This  is  a  set  of  arc-identifiers  corresponding  to 
DIAs  along  which  the  deadlock  computation  has  already  traveled, 
hence  no  new  traversals  along  these  DIAs  should  be  initiated  in  this 
deadlock  computation. 

(e)  0A^5ET:  This  is  a  set  of  arc-identifiers  corresponding  to  DIAs  at  the 
site  to  which  the  FP  signal  is  being  sent.  These  DIAs  are  the  arcs  along 
which  the  graph  traversal  is  continued  at  the  recipient  site. 
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The  resource  controller  algorithm  is  event-driven.  Hie  relevant  events 

are 

(i)  timeouts  and  arrivals  of  signals 

(ii)  locally  originating  requests  for  (a)  obtaining  locks  on  resources  and 
releasing  them,  (b)  creating  agents  at  other  sites  and  (c)  sending  to,  and 
waiting  to  receive  messages  from,  agents  of  the  same  transaction  at  other 
sites. 

It  is  assumed  in  the  following  description  that  when  a  transaction  completes, 
the  nodes  representing  its  agents  and  arcs  incident  at  them  are  deleted.  C/, 
represents  the  current  value  of  the  local  clock. 

4.6  2. 2.  Description  of  Algorithm 

Below  we  describe  the  actions  taken  by  the  resource  controller  on  the 
occurrence  of  each  event: 

Requesting.  Granting  and  Freeing  Resources 

(1)  Agent  71s  requests  resources  FI,  F 2.  F 3 FN  (in  specified  modes) 

(a)  If  not  all  requested  resources  are  local,  or  if  not  all  requested  local 
resources  are  available  in  the  required  modes,  set  the  status  of  71s  to 
waiting . 

(b)  For  each  resource  available  in  the  required  mode,  create  the 
appropriate  internal  arc  (  indicating  resource  possession  ),  with 
timed— out  set  to  TFUE.  (COMMENT:  A  timeout  need  not  be  started 
for  this  arc,  since  if  its  creation  results  in  a  deadlock  cycle,  it  will  be 
detected  when  the  timeouts  for  the  arcs  created  in  (c),  (d)  below, 
complete.) 

(c)  For  each  local  resource  unavailable  in  the  required  mode,  create 
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the  appropriate  internal  arc  (  indicating  resource  wait  ),  with 
timed_gut  set  to  FALSE.  Start  a  timeout  of  period  7*1. 

(d)  For  each  non-local  resource,  create  the  appropriate  DOA  (indicat¬ 
ing  resource  wait),  with  timed_gut  set  to  FALSE,  mark  set  to  Q,,  and 
toas  set  to  null.  Start  a  timeout  of  period  TZ  and  send  a  RR  signal  to 
the  site  that  controls  the  resource,  with  TSr,  set  to  the  same  value  as 
the  mark  field. 

(2)  A  RR  signal  arrives  from  T.s',  sVs  for  resource  Ri  with  a  timestamp  TSp. 

(a)  If  the  resource  is  available  in  the  required  mode, 

(i)  create  the  appropriate  DOA  (  indicating  resource  possession 
),  with  timed_pid  set  to  FALSE,  mark  set  to  Q,  and  ioas  set  to 
null,  and  start  a  timeout  of  period  T4. 

(ii)  send  a  RG  signal  to  the  requesting  site,  with  TSq  set  to  the 
same  value  as  the  mark  field. 

(b)  If  the  resource  is  unavailable  in  the  required  mode,  create  the 
appropriate  DIA  (indicating  resource  wait),  with  timed_out  set  to 
FALSE,  mark  set  to  TS#  and  start  a  timeout  of  period  T 1. 

3.  A  RG  signal  in  response  to  a  request  from  a  local  agent  T.s  for  a  non  local 
resource  R  arrives,  with  timestamp  TS q. 

(a)  Delete  the  DOA  that  represents  T.s  waiting  for  R  and  abort  the 
associated  timeout  if  tvmed_put  is  FALSE. 

(b)  Create  the  appropriate  DIA  (  indicating  resource  possession  )  set¬ 
ting  the  mark  field  to  TSC.  If  no  outgoing  arcs  remain  at  the  node 
T.s,  set  its  status  to  active  and  set  timed_put  on  the  DIA  to  TRUE. 
Otherwise  set  the  field  to  FALSE  and  start  a  timeout  of  period  T3. 


IF 


(4).  Agent  T.s  releases  resources  Rl,R2,R3...RM . 


(a)  For  each  Ri  that  is  local, 

(i)  delete  the  appropriate  internal  arc  (  indicating  resource 
possession  ),  aborting  the  associated  timeout  if  timed_put  is 
FALSE. 

(ii)  choose,  if  possible,  a  set  of  transaction  agents  waiting  to 
access  Ri  that  can  now  be  given  access  to  it. 

(iii)  for  these  agents,  delete  the  appropriate  internal  arcs  or 
DIAs  (indicating  resource  wait),  aborting  associated  timeouts 
when  the  timed_put  fields  are  FALSE. 

(iv)  for  each  non-local  agent,  follow  the  steps  given  in  2(a)(i) 
and  2(a)(ii). 

(v)  for  each  local  agent  granted  access  to  Ri  in  (ii)  above, 
create  the  appropriate  internal  arc  (  indicating  resource  pos¬ 
session  ).  If  no  outgoing  arcs  remain  from  the  agent  node,  set 
its  status  to  ready  and  set  the  timed _oiil  field  on  the  internal 
arc  to  TRUE.  Otherwise,  set  the  field  to  FALSE  and  start  a 
timeout  of  period  T 3. 

(b)  For  each  non-local  resource,  delete  the  appropriate  D1A,  aborting 
the  associated  timeout  if  timed_out  is  FALSE.  Send  a  RF  signal  to 
the  site  controlling  the  resource. 

(5)  A  RF  signal,  indicating  that  T.s  has  released  a  resource  Ri  local  to  s, 
arrives. 

(a)  Delete  the  DOA  from  Ri  to  T.s ,  aborting  the  associated  timeout  if 
timed_out  is  FALSE. 


(b)  Perform  steps  4(a)(ii)-(v). 


Agent  Creation.  Sending  and  Receiving  Messages 

(6)  T.s  requests  creation  of  an  agent  at  site  r?s. 

(a)  Create  a  D1A  (  indicating  message-possession  )  from  M(T.s,r )  to 
T.s  with  mark  set  to  lr  and  tvned_put  set  to  TRUE. 

(b)  Create  a  DOA  (  indicating  message  possession  )  from  H(T.r.s)  to 
T.r  with  mark  set  to  Is  and  timed_put  set  to  TRUE.  COMMENT:  Note 
that  creation  of  DIAs  and  DOAs  that  indicate  message  possession  do 
not  create  a  cycle  and  hence  no  timeout  need  be  started.  It  is  wait¬ 
ing  to  receive  a  message  that  can  complete  a  cycle. 

(c)  Send  a  AC  signal  to  site  T. 

(7)  A  AC  signal  is  received  is  received  from  site  r*s  for  creation  of  an  agent 
of  transaction  T. 

Carry  out  steps  6(a)  and  6(b)  with  T  replacing  T. 

(8)  T.s  sends  a  message  to  T.r  1,T.t2.T.t3...T.tK. 

For  each  rj . 

(a)  increment  the  message  number  in  the  mark  field  on  the  DIA  from 
Af(T.s.rj)  to  T.s  by  1. 

(b)  send  the  message  to  site  rj. 

(9) 7".*  waits  for  a  message  from  T.rl,T.r2....T.rL. 

(a)  If  at  least  one  message  is  queued  for  T.s  from  each  of 
T.r  \,T.t2...T.tL,  then  remove  the  message  at  the  head  of  each  queue 


and  supply  the  messages  to  T.s. 

(b)  If  at  least  one  of  the  queues  is  empty 

(i) set  the  status  of  T.s  to  waitirg. 

(ii)  For  each  ri  such  that  there  are  no  messages  queued  from 
T.ri  to  T.s,  create  an  internal  arc  (  indicating  message  wait  ) 
from  T.s  to  Ii(T,ri,s)  with  timed_pui  set  FALSE  and  start  a 
timeout  of  period  Tb. 

(iii)  For  each  ri  such  that  the  queue  of  messages  from  T.ri  to 
T.s  is  non-empty,  remove  the  message  at  the  head  of  the  queue 
and  supply  it  to  T.s. 

(10)  A  message  from  T.r  for  T.s  is  received. 

(a)  Increment  the  message  number  in  the  marJfc  field  on  the  DOA  from 
U(T,r,s)  to  T  .r  by  1.  Set  the  ioas  field  to  null. 

(b)  If  there  is  an  arc  from  T.s  to  M(T,r,s),  delete  the  arc,  abort  the 
associated  timeout  if  fimedjmf  is  FALSE,  and  supply  the  message  to 
T.s\  otherwise  queue  the  message.  If  there  are  no  remaining  outgoing 
arcs  from  T.s  set  its  status  to  active. 

Timeouts,  Forward  and  Backward  Propagation 

(11)  Timeout  on  an  internal  arc  a  completes. 

(a)  Set  the  tvned_out  field  on  the  arc  to  TRUE. 

(b)  Check  for  a  cycle  of  internal  arcs  involving  the  arc  a.  If  one  or 
more  such  cycles  exist,  abort  the  transaction  involved  in  the  arc  and 
stop.  (Aborting  the  transaction  involves  releasing  the  resources  held 
by  the  agents  of  the  transaction,  deletion  of  the  nodes  representing 
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the  agents,  deletion  of  the  arcs  incident  at  these  nodes  and  abortion 
of  associated  timeouts,  as  necessary). 

(c)  Traverse  the  graph  forwards  from  the  head  of  a,  along  internal 

arcs  with  their  timed_put  fields  TRUE,  to  find  all  the  DOAs  and  associ¬ 
ated  lOAs  that  can  be  reached.  Let 

ROA( a,*)  =  {j[ROA{a  ,v l),f?0A(a,v2),.../?0<4(a,vp)],  be  the  set  of  arc- 
identifiers  corresponding  to  these  DOAs  and  associated  lOAs,  where 
ROA(a,vi)  is  the  set  of  arc-identifiers  corresponding  to  DOAs  and  asso¬ 
ciated  lOAs  reachable  by  the  above  procedure  from  a  whose  head 
nodes  are  local  to  site  vi.  Let  ROA(a ,•)  be  computed  in  a  similar  way 
to  ROA( a,*)  with  the  added  restriction  that  only  DOAs  with  tixned_put 
fields  set  to  TRUE  and  their  associated  IOAs  are  considered. 

(d)  Traverse  the  graph  backwards  from  the  tail  of  a.  along  internal 
arcs  with  their  timed_ont  fields  TRUE,  to  find  all  the  DIAs  with  their 
timedjmt  fields  also  TRUE  that  can  be  thus  reached.  Let 
RIA(a,9)~\j[RIA(a,‘w\),RIA{a,wZ)....RIA(a,‘wq)],  be  the  set  of  these 
DIAs,  where  RIA(a.wj)  is  the  set  of  DIAs  at  site  s,  whose  timed_aut 
fields  are  TRUE,  whose  tail  nodes  are  local  to  site  w j  and  from  the 
heads  of  each  of  which,  a  path  of  internal  arcs  with  timedjput  fields 
set  to  TRUE,  leads  to  the  tail  of  a. 

(e)  If  one  of  the  members  of  {vj,  say  vp,  is  the  site  s  itself,  and 
ROA(a,vp)  contains  a  member  that  corresponds  to  a  D1A  in  RIA( a,*), 
then  a  deadlock  exists,  hence  abort  the  transaction  involved  in  o  and 
stop. 

If  RIA( a,*)  is  non-empty  then 

(i)  Backward  Propagation:  if  ROA( a,*)  is  non-empty,  then  for 
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each  site  wj  send  a  BP  signal  with  lA^SET  set  to  RIA{a,,wj)  and 
OA^ET  set  to  ROA\ a,*); 

(ii)  Forward  Propagation:  if  ROA(a,*)  is  non-empty,  then  for 
each  site  u,,  send  a  FP  signal  with  ORIGIN  set  to  s,  VICTIM  set  to 
the  transaction  associated  with  the  arc  a,  CHECKJ3ET  set  to 
RIA( a.*).  TRAVERSED_SET  set  to  ROA(aS)  and  OA^ET  set  to 
ROA(a,Vi). 


COMMENT:Suppose  a  chain  of  arcs  consisting  of  a  DIA  (4,  zero  or  more 
internal  arcs  and  a  DOA  ,  is  formed  at  the  site  s .  This  will  result  in 
the  sending  of  a  BP  signal  to  the  site  to  which  the  tail  of  d\  is  local.  In 
order  to  prevent  possible  multiple  identical  BP  signals  from  being  sent 
when  the  timeouts  on  the  arcs  on  this  chain  complete,  the  algorithm 
is  designed  so  that  the  last  of  the  arcs  on  the  chain  to  have  its 
timed jrut  field  set  TRUE  is  the  only  one  whose  timeout  completion 
results  in  a  BP  signal  being  sent. 

A  similar  situation  exists  in  the  case  of  the  FP  signals.  However,  the 
completion  of  the  timeout  on  a  DOA  does  not  initiate  a  FP  signal  (  it 
would  be  redundant,  since  there  is  a  DIA  corresponding  to  the  DOA,  at 
another  site,  whose  timeout  completion  would  trigger  FP  signals  if 
necessary  ).  Hence,  here  the  algorithm  calls  for  an  FP  signal  to  be 
sent  when  the  last  arc  on  the  chain  excluding  d^, ,  has  its  timedjrut 
field  set  TRUE. 

(12)  Timeout  completes  on  DIA,  dj. 

(a)  Set  timedjgut  field  of  d»  to  TRUE. 

(b)  Let  RJA(dt,*)=RlA[di,w)  =  d,  where  u>  is  the  site  to  which  the  tail 
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of  is  local.  Compute  ROA(dy,*)  and  ROA{di*)  analogously  to  step 
11(c)  above,  starting  the  graph  traversal  from  the  head  of  d,. 

(c)  Perform  backward  and  forward  propagation  in  the  same  manner  as 
in  step  11(e). 

(13)  Timeout  completes  on  DOA,  £4. 

(a)  Set  tim.ed._put  on  <4  to  TRUE. 

(b)  Let  ROA(d B,*)  =  (  arc-identifier  of  d*  )  (J  (ioas  field  of  c^).  Com¬ 
pute  RJA(d0*)  analogously  to  step  11(d),  starting  the  backward 
traversal  from  the  tail  of  d„ . 

(c)  If  RIA^d,*)  is  non-empty,  perform  backward  propagation  in  the 
manner  described  in  step  ll(e)(i). 

(14)  An  FP  signal  f  is  received. 

(a)  From  OA_£>ET(f),  check  which  members  tally  with  the  arc-identifier 

of  a  D1A  whose  timed_put  field  is  set  TRUE.  Let  be  the  set  of 

such  DIAs. 

(b)  For  each  d*  in  X  determine  ROAfa,*)  as  in  step  11(c).  If  the  head 

of  any  member  of  ROA{dx,*)  is  local  to  the  site  ORIGIN(f),  and  the 
member  tallies  with  a  member  of  CHECK_J5ET(f),  a  deadlock  exists, 
hence  abort  transaction  VICTlM(f)  and  stop.  Delete  those  arc- 
identifiers  from  ROA(di,*)  that  tally  with  a  member  in 
TRAVERSED_J5ET(f).  Let  the  remaining  set  of  arc-identifiers  be  desig¬ 
nated  XROA(di,m).  Let  5  =  \OA(v  i),OA(vz) . GA(vn){  be  the  union  of 

XROA(dl,9)  for  all  d,  in  X,  partitioned  according  to  the  sites  j  to 
which  the  heads  of  the  outgoing  arcs  are  local. 

(c)  For  each  Vj,  send  an  FP  signal  to  site  v,  with  ORIGIN  set  to 
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ORJGlN(f),  CHECK^ET  set  to  CHECK^ET(f),  TRAVERSED_5ET  set  to 
TRAVERSED_J3ET(f) \jS  and  OA_gET  set  to  OA (vj). 


(15)  A  BP  signal  b  is  received. 

For  each  member  d  of  lAk_5ET(b),  if  a  DOA  d,  exists  that  tallies  with  d, 

(i)  add  the  members  of  OA_f>ET(b)  to  the  tons  field  of  the  DOA. 
ignoring  those  whose  head  is  the  same  as  the  tail  of  the  DOA.  (  This 
situation  can  arise  when  a  cycle  exists.  ) 

(ii)  if  the  timed_put  field  of  *4  is  TRUE,  compute  RIA(d„*)  analo¬ 
gously  to  step  11(d),  performing  the  backwards  graph  traversal 
from  the  tail  of  4 »■  Then,  if  RIA{d9,m)  is  non-empty,  send  BP  signals 
as  in  ll(e)(i),with  the  OA_£ET  set  to  the  set  of  new  members  in  the 
ioas  field  of  da . 

The  timeout  periods  of  Tl,T2,T3,T4,Tb  are  started  when  an  agent  is 
waiting  to  acquire  a  lock  on  a  local  resource,  an  agent  is  waiting  to  acquire  a 
lock  on  a  remote  resource,  a  resource  is  waiting  to  be  released  by  a  local 
agent,  a  resource  is  waiting  to  be  released  by  a  remote  agent  and  when  an 
agent  is  waiting  to  receive  a  message.  They  should  therefore  have  values 
appropriately  in  excess  of  the  average  times  required  for  these  respective 
events  to  occur.  It  is  clear  that  T2—T 1  and  T4—T3  should  be  of  the  order  of 
a  message  delay. 

4.S.2.3.  An  Example  of  Deadlock  Processing 

In  this  section,  we  give  an  example  to  show  how  the  algorithm  works. 
The  same  conventions  are  used  in  Fig.  4.7.a-g  as  in  Figs.  4.5  and  4.6,  except 
that  the  value  of  the  timed_put  field  on  each  arc  is  also  shown  (  T  and  F  stand 
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for  TRUE  and  FALSE  respectively  ).  Further  the  flow  of  BP  and  FP  signals  is 
also  shown.  There  are  3  sites  A,  B,  C  in  this  example. 

Fig.  4.7.a:  The  transaction  agents  Tl.A,  Tl.B  and  Tl.C  request  locks  on  local 
resources  R 1,  R 2,  R 3  respectively  and  acquire  them.  Since  no  outstanding 
arcs  from  the  agents  remain,  no  timeouts  are  started  and  the  timedjxut 
fields  are  set  to  TRUE  as  soon  as  the  arcs  are  created. 

Fig.  4.7.b:  The  following  events  have  occurred  since  the  situation  depicted  in 
Fig  4. 7. a  existed. 

(i)  TZ.B  and  T3.C  requested  locks  on  remote  resources  R 3  and  R1  respec¬ 
tively  at  times  80,61  respectively.  This  leads  to  these  agents’  status  being 
set  to  waiting  and  to  the  creation  of  DOAs  at  sites  B  and  C  respectively.  The 
timed_gut  field  on  each  of  these  arcs  is  set  to  FALSE  and  timeouts  are 
started  on  both  of  them.  RR  signals  are  sent  to  C  and  A  respectively. 

(ii)  The  RR  signals  are  received.  Since  the  requested  resources  are  unavail¬ 
able,  DIAs  are  created  at  sites  C  and  A  with  their  timed_put  fields  set  to 
FALSE  and  timeouts  are  started  on  the  DIAs. 

(iii)  Tl.A  requests  creation  of  an  agent  Tl.B  at  site  B.  This  leads  to  the 
creation  of  appropriate  message  nodes  and  incident  arcs  at  site  A  An  AC  sig¬ 
nal  is  sent  to  site  B  to  create  an  agent  Tl.B 4. 

(iv)  On  receiving  the  AC  signal,  appropriate  message  nodes  and  incident  arcs 
are  created  at  site  B. 

F\g.  4.7.c:  (i)  Tl.B  has  requested  a  lock  on  R 2.  Since  R 2  is  unavailable,  the 
status  of  Tl.B  is  set  to  waiting,  the  timed_gut  field  on  the  arc  from  Tl.B  to 
RZ  is  set  to  FALSE  and  a  timeout  started. 

(ii)  At  site  A  Tl.A  completes  its  local  processing  and  now  waits  to  receive  a 


message  from  Tl.B.  It  enters  waiting  status,  an  arc  from  Tl.A  to 
M(Tl,B  ,A)  is  created,  its  tim.ed_mit  field  set  to  FALSE  and  a  timeout  is 
started. 

Fig.  4.7.d:(i)  The  timeouts  on  the  DIAs  T3.C-*eiRl  at  site  A  and  T2.B at 
site  C  complete,  and  therefore  the  txmedjrul  fields  are  set  to  TRUE.  Condi¬ 
tions  are  now  satisfied  for  site  C  to  send  an  FP  signal  to  site  A  with  its 
CHECK_$ET  containing  the  DIA  T2.B-*qqR3  and  its  0A_S5ET  containing  the 
arc-identifier  (/f  1,61 ). 

(ii)  On  receipt  of  this  FT  signal,  site  A  finds  a  DIA  with  its  timed_gut  field  set 
to  TRUE  and  its  arc-identifier  tallying  with  the  arc-identifier  in  the  OA^JSET. 
However,  there  is  no  DOA  reachable  from  this  DIA  through  a  path  of  internal 
arcs  with  their  timed_put  fields  TRUE.  Hence,  this  deadlock  computation 
stops  here. 

Fig.  4.7.(e)  (i)  The  timeouts  on  the  DOAs  T2.B-*toB3  at  site  B  and  T3.C-*eiRl 
at  site  C  complete  and  therefore  the  fimedjmf  fields  are  set  to  TRUE.  A  BP 
signal  is  sent  by  site  C  with  lA^JJET  containing  the  DIA  T2.B  -*eoR3  and  0AJ5ET 
the  arc-identifier  (R  1,61). 

(ii)  On  receiving  this  BP  signal,  site  B  finds  that  the  lA^ET  member  matches 
a  DOA  and  hence  the  arc-identifier  (F?l,6l)  is  added  to  the  ioas  field  of  this 
DOA.  Backward  propagation  from  site  B  is  inhibited  since  the  arc  from  Tl.B 
to  R 2  has  not  completed  its  timeout. 

Fig  4.7.f  (i):  The  timeout  period  on  the  arc  from  Tl.B  to  R2  completes  and 
the  fnmed_puf  field  is  set  to  TRUE.  The  following  signals  get  sent: 

—an  FT  signal  fl  to  site  C  with  ORIGIN  set  to  B,  VICTIM  set  to  T 1, 
TRAVERSED_J5ET  set  to  \(R3,60).(R  1,61) j,  OA^SET  set  to  ((^3.60))  and 


DISTRIBUTED  ALGORITHM  (Contd.) 


i  .*v  'L^ '.^'1.1^' .V V.1"’,* I|>,,»^  I.'  1  •  1 »’  r  .'r-.-Tr.'  r-yr  -r--r  t-h- 


1B2 


CHECKJ3ET  set  to  \M(T1.B.A)*ia  T\.B\. 

—an  FP  signal  f2  to  site  A  the  same  as  above,  except  that  OA^ET  is  set 
to  {(/?  1,61)}. 

—a  BP  signal  bl  to  site  A  with  lA^jsET  set  to  [M(T1,B  ,A)-*U  71. i?}  and 
OA^SET  set  to  {(/?3.60).{/?l,6l){. 

(ii)  The  FP  signal  fl  reaches  site  C.  The  OA_?ET  member  matches  the  sole 
D1A  at  C.  However,  although  its  sole  DOA  is  reachable  by  a  path  of  internal 
arcs  with  their  timed^out  fields  set  TRUE  from  this  DIA,  its  arc-identifier  is 
included  in  the  TRAVERSED_SET  field  of  f  1,  hence  no  FP  signal  is  generated. 

(iii)  The  FP  signal  f2  reaches  site  A.  Although  the  OA^jSET  member  matches  a 
DIA,  no  DOA  is  reachable  from  this  DIA  through  a  path  of  internal  arcs  with 
their  timed_gut  fields  set  TRUE ,  and  again  no  FP  signal  is  generated. 

If  the  BP  signal  bl  reaches  site  A  and  is  processed  before  the  timeout  period 
on  the  arc  T\.A-*M(T1,B,A)  completes,  then  the  deadlock  will  be  detected 
locally  when  the  latter  event  occurs.  In  our  example,  the  BP  signal  does  nof 
reach  site  A  in  time  for  this  to  occur. 

Fig.  4.7.g:  (i)  The  timeout  period  on  the  arc  from  71. A  to  M(Tl,B,A)  com¬ 
pletes  and  the  timed^put  field  is  set  to  TRUE.  This  leads  to  two  signals  being 
sent: 

—  an  FP  signal  /3  to  site  B  with  VICTIM  set  to  71,  ORIGIN  set  to  A, 
OA^ET  and  TRAVERSEDJ3ET  set  to  {(7l.B,L4)j  and  CHECKJgET  set  to 
J73.C-*ei/?lJ. 

—  a  BP  signal  b2  to  site  C  with  OA^jJET  set  to  {(71  .B,\A)\  and  1AJSET 
set  to  \T3.C-*tlRl }. 

(ii)  On  receipt  of  f3,  site  B  finds  that  the  0A_J5ET  member  matches  its  DIA 
U{T\,B ,A)-*ia  71. B.  From  this  DIA  the  DOA  72.F-»ao#3  and  its  associated 
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10A  with  arc-identifier  (/?  1,61}  can  be  reached  through  a  path  of  internal  arcs 
on  which  the  timedjout  fields  are  set  to  TRUE.  But  the  arc-identifier 
(/?l,6l)  tallies  with  the  CHECK.JJET  member  of  the  FP  signal.  Hence  a 
deadlock  is  declared  and  transaction  T 1  aborted. 

4.S.2.4.  Proofs  of  Correctness 

In  this  section  we  show  that  the  algorithm  detects  all  genuine  deadlocks 
and  does  not  give  any  indications  of  deadlock  when  none  exists. 

Thm  4.4:  Every  genuine  multisite  deadlock  is  detected  by  the  above  algo¬ 
rithm. 

FVoof:  Suppose  there  is  a  global  deadlock  cycle:  e(l)-*e(2)-*e(3)....-*e(m) 
where  e(.)  is  a  transaction  agent,  a  resource  or  message  node,  with  each 
node  being  distinct. 

Each  dependency  is  represented  as  an  internal  arc  or  DIA  at  some  site.  The 

cycle  takes  the  form  of  a  chain  of  arcs  C:  {d^ (s, ),i(s ,,1  ),i(s ,.2) . i(s 

K(s8).i(s8.l).i(s8.2) . . K(sm),i(slri,l),i(sm.2) . i(sm,  n.Jj. 

The  arcs  enclosed  within  a  pair  of  curly  braces  represent  a  contiguous  por¬ 
tion  of  the  cycle  contained  at  one  site,  with  neighboring  portions  of  the  cycle 
being  at  a  different  site.  Each  such  contiguous  portion  consists  of  a  DIA  and 
zero  or  more  internal  arcs.  The  head  of  the  last  arc  is  the  tail  of  a  DOA  that 
coincides  with  the  DIA  for  the  next  portion  of  the  cycle. 

Every  transaction  agent  in  C  is  blocked.  Inspection  of  the  algorithm 
shows  that  when  an  agent  in  C  enters  wailing  state,  and  the  arc  in  C  that  is 
directed  away  from  the  agent  is  created  a  timeout  period  is  started  for  the 
arc.  Hence  at  least  one  timeout  period  is  started  in  connection  with  the 


creation  of  an  arc  in  C.  Since  C  represents  a  deadlock,  a  timeout  started  will 
not  be  aborted  and  will  complete 

Let  a  be  the  arc  in  C  whose  timeout  is  last  to  complete  among  all  the 
timeouts  that  are  started  in  connection  with  the  creation  of  arcs  in  C  and  let 
t  be  the  time  this  last  timeout  completes.  We  claim  that  all  arcs  in  C  will 
have  their  timed_gut  fields  set  TRUE  by  t.  By  definition  of  t.  all  arcs  in  C 
that  have  timeouts  started  on  their  creation,  will  have  their  timeouts  com¬ 
plete  and  their  timed_out  fields  set  TRUE  by  t.  But  each  arc  in  C  that  does 
not  have  a  timeout  started  on  its  creation,  occurring  at  tl,  and  therefore  has 
its  timed_gxd  field  set  TRUE  immediately  at  tl,  must,  by  inspection  of  the 
algorithm,  be  an  internal  arc  indicating  resource  possession  or  a  DIA  indicat¬ 
ing  message  possession.  The  transaction  agent  at  the  head  of  this  arc  must 
at  a  time  equal  to  or  greater  than  tl,  enter  uiatftnp  state  as  a  result  of 
requesting  the  resource  or  message  to  which  there  is  an  arc  in  C  from  the 
agent.  A  timeout  will  be  started  for  this  arc  (  by  inspection  of  the  algorithm 
)  which  completes  at  t2  <  t.  Since  tl  s  t2,  tl  <  t_ 

Let  the  arc  o  be  in  site  s1  (without  loss  of  generality).  At  t,  an  FP  signal 
fl  will  be  sent  to  s2  with  the  arc-identifier  of  dx(sxt  in  its  0A^3ET  and  o^(s,)  in 
its  CHECK^SET  (unless  the  arc-identifier  of  dy(sx)  is  present  as  an  10A  reach¬ 
able  from  o,  in  which  case  the  deadlock  is  detected  locally).  If  on  receiving 
this  signal,  se  does  not  send  an  FP  signal  f2  to  ss  with  the  arc-identifier  of 
<i,(sg)  in  its  0AJ5ET  and  dt($i)  in  its  CHECK_SET,  it  will  be  either  because 

(i)  the  TRAVERS ED_JSLT  of  fl  already  contains  this  arc-identifier,  which  means 
that  S|  simultaneously  with  sending  fl  to  se  sent  an  FP  signal  fl’  to  ss  with 
the  arc-identifier  of  dt(sg)  in  its  0A_J5ET,  or  because 

(ii)  the  deadlock  cycle  is  detected  at  Sg,  as  a  result  of  the  presence,  as  an  IOA 


reachable  from  <^(s2),  of  the  arc-identifier  of  d»(si). 

In  any  case,  unless  the  deadlock  is  detected  at  Sj  or  $2,  site  ss  will  receive  an 
FP  signal  with  the  arc-identifier  of  d,(sa)  in  its  OA^JJET  and  ^(s,)  in  its 
CHECKJ3ET.  Proceeding  in  this  manner,  we  can  show  that  unless  the 
deadlock  is  detected  at  one  of  the  sites  Si, an  FP  signal  will  be 
received  by  sm  with  in  its  0A_J5ET  and  d*(s  j)  in  its  CHECK_SET.  But  the 

existence  of  the  cycle  C  means  that  there  is  a  DOA  at  sm  reachable  through  a 
path  of  internal  arcs  with  their  timedjrut  fields  set  TRUE  and  corresponding 
to  ^(sj.  Hence  the  deadlock  will  be  detected. 


Thm  4.5:  No  false  indications  of  multisite  deadlock  are  given  by  the  given 
algorithm. 

ftflof:  Suppose  a  multisite  deadlock  is  detected  at  the  origin  of  a  deadlock 
computation,  as  a  result  of  the  presence  of  an  IOA  reachable  from  a  D1A, 
which  corresponds  to  the  IOA  itself.  The  presence  of  such  an  10A  means  that 
the  DIA  cannot  vanish  until  after  it  vanishes,  i.e.  it  will  not  vanish  (except  by 
transaction  aborts).  Hence  the  deadlock  is  genuine. 

Suppose  that  a  deadlock  be  detected  as  the  result  of  the  following 

sequence  of  FP  signals  f0,fl,f2 . f(l-l).  originating  at  sites  s0,  *>,  *e . s4_i 

and  sent  to  sites  tg,  *3,....Sj  respectively.  Receipt  of  fj  (j=0.1.2....i-2) 

triggers  sending  of  f(j+l).  Eet  the  sending  of  fj  (j=0,l . 1-1)  occur  at  time  tj. 

Consider  the  set  of  DIAs  in  the  CHECK-SET  of  fO.  At  tO,  s0  has  not  sent  the 
signals  that  will  accompany  the  deletion  of  any  of  these  DIAs.  Further,  it  can 
do  this  only  after  all  the  arcs  corresponding  to  the  arc-identifiers  in  the 
OA^ET  of  fO  have  been  deleted  at  Sq. 
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At  tt.  when  fO  is  received  at  site  s,.  it  is  found  that  one  or  more  entries  in 
OA^ET(fO)  coincide  with  DIAs  in  Sj.  Unless  and  until  these  DIAs  are  deleted 
at  Si  and  the  corresponding  signals  sent  to  sq.  it  is  not  possible  to  delete  all 
the  DOAs  in  OA^J5ET(fO).  But  all  these  DIAs  cannot  be  deleted  before  all  the 
arcs  corresponding  to  the  arc-identifiers  in  OA^SET(fl)  are  deleted.  Thus  it  is 
not  possible  to  delete  any  of  the  DIAs  in  CHECK_f5ET(fO)  before  all  the  arcs 
corresponding  to  the  arc-identifiers  in  0A_J5ET(fl)  are  deleted. 

Proceeding  in  this  manner,  we  conclude  that  none  of  the  DIAs  in 
CHECKJ3ET(fO)  can  be  deleted  before  all  the  arcs  corresponding  to  the  arc- 
identifiers  in  OA_f>ET(f(l-l))  are  deleted. 

Since  the  deadlock  is  detected  at  site  sm  on  receipt  of  f(l-l))  it  follows  that: 

(i)  at  the  time  of  declaration  of  deadlock,  there  are  DIAs  in  si  that 
coincide  with  members  of  0AJ5ET{f(J-l)). 

(ii)  from  these  coincident  DIAs.  it  is  possible  to  reach,  at  the  time  of 
declaration  of  deadlock,  DOAs  or  lOAs  that  are  coincident  with 
members  of  CHECK_J3ET(fO). 

Hence,  using  the  same  reasoning  as  above,  it  follows  that  these  coincident 
members  of  CHECK_5ET(fO)  cannot  be  deleted  until  after  they  are  deleted, 
which  implies  that  they  cannot  be  deleted  at  all  (  unless  by  transaction 
aborts  ).  Hence  the  deadlock  indication  is  correct. 


4. 6. 2. 5.  Performance 


The  proposed  algorithm  detects  multisite  deadlocks  faster  than  other 
algorithms  that  have  been  proposed  in  the  literature.  This  is  because  both 
forward  and  backward  propagation  are  used  on  this  algorithm. 

In  the  worst  case,  when  all  the  wait  dependencies  in  an  intersite  cycle 
form  at  approximately  the  same  time,  the  time  required  to  detect  the 
deadlock  is  approximately  half  the  time  it  would  require  to  go  round  the 
cycle.  All  distributed  schemes  proposed  so  far  require  a  detection  time 
equal  to.  if  not  greater  than,  the  time  to  go  round  the  cycle. 

If  the  last  wait  dependency  occurs  after  a  substantially  long  time  from 
the  rest  of  the  w^it  dependencies  in  the  cycle,  the  deadlock  will  be  detected 
locally  without  having  to  go  round  the  cycle.  For  cases  in  between  the  two 
extreme  cases  cited  above,  the  detection  time  will  be  intermediate. 

The  penalty  paid  for  the  improvement  in  response  time  is  higher  mes¬ 
sage  traffic.  For  a  deadlock  involving  n  sites,  our  algorithm  requires  a  max¬ 
imum  of  approximately  n*  FP  signals.  (  Each  site  may  send  one  signal  to 
each  of  the  other  n-1  sites,  serially  if  no  10As  are  present  (  the  signals 
flowing  around  the  loop  )  or  in  a  combination  of  sequential  flow  and  parallel 
flow  if  IOAs  are  present  ).  Algorithms  proposed  till  now  use  only  serial  flow 
and  in  such  algorithms  it  is  possible  to  reduce  the  amount  of  communication 
by  half  by  requiring  a  serial  flow  of  messages  originating  from  a  given  tran¬ 
saction  to  stop  when  it  encounters  a  transaction  of  higher  id  than  the  ori¬ 
ginating  transaction.  This  optimization  is  difficult  to  incorporate  in  our  algo¬ 
rithm,  in  which  FP  signals  can  "skip  over”  one  or  more  sites,  and  thus  over 
the  nodes  in  the  cycle  in  those  sites. 
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Secondly,  there  is  the  overhead  of  backwards  propagation  which  is  also 
approximately  of  the  order  of  n8  signals  for  a  cycle  encompassing  n  sites. 

In  practice,  chains  of  dependencies  spreading  over  a  large  number  of 
sites  are  unlikely  to  occur  except  under  conditions  of  severe  contention.  It 
appears  reasonable  to  pay  the  cost  of  higher  communication  traffic  at  such 
times  in  order  to  quickly  detect  any  deadlock  that  may  exist  and  which  if  not 
detected  for  a  longer  time,  would  exacerbate  the  contention. 

4.7.  Conclusion 

In  this  chapter,  we  have  discussed  centralized  and  distributed  detection 
of  deadlocks  in  a  distributed  system.  For  the  case  of  centralized  detection, 
we  showed  that  reports  only  from  resource-controlling  sites  are  sufficient,  if 
usage  of  resources  is  2-phase,  to  detect  deadlocks  correctly.  For  the  case 
where  detection  is  centralized  but  resource  usage  is  not  2-phase,  we  con¬ 
structed  an  algorithm  for  the  "migrating"  transaction  model,  in  which  non¬ 
resource-controlling  sites  are  only  rarely  required  to  participate  in  deadlock 
detection.  Since  resource-controlling  sites  will  be  generally  few  compared  to 
the  number  of  sites  accessing  the  resources,  the  communication  costs  for 
deadlock  detection  are  sharply  reduced  by  this  algorithm.  Lastly,  we  con¬ 
structed  a  distributed  detection  algorithm  which  uses  both  forward  and 
backward  traversal  of  the  transaction_Pgent-resource-message  graph  to 
speed  up  deadlock  detection. 

The  algorithms  utilize  clock  facilities  to  address  the  problem  of  race 
conditions.  In  the  centralized  algorithm,  timestamps  derived  from  the  clock 
facility  are  assigned  to  every  request  for  a  resource.  When  the  deadlock 
detector  site  assembles  reports  from  the  resource  controllers  in  the  system, 
it  uses  these  timestamps  to  ascertain  if  an  observed  cyclic  wait  represents  a 
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genuine  deadlock.  In  the  distributed  scheme,  timestamps  (or  message 
numbers  concatenated  with  site  id)  are  affixed  to  every  intersite  arc.  This 
allows  use  of  "condensed"  information  to  hasten  the  detection  of  deadlocks. 
Without  the  timestamps,  it  would  be  difficult  to  ascertain  if  the  condensed 
information  is  up-to-date.  It  is  difficult,  as  pointed  out  in  [GL1  BO]  to  update 
this  condensed  information  soon  enough  to  prevent  spurious  indications  of 
deadlock.  By  affixing  marks  as  mentioned  above  to  the  condensed  informa' 
tion,  the  urgency  of  updating  it  is  removed. 


CHAPTER  5 


CONCLUSION 

In  this  report,  we  addressed  the  problems  of  maintaining  the  availability 
and  consistency  of  global  information  in  computer  networks.  Below,  we  sum¬ 
marize  our  results,  describe  some  experiences  during  the  research  and  sug¬ 
gest  future  directions. 

In  Chapter  2.  we  described  a  network  status  maintenance  scheme  based 
on  a  global  clock  mechanism.  Our  scheme  differs  from  that  of  [HAM  BO]  in 
that  it  relies  upon  the  nearest  neighbors  of  a  site  to  determine  its  status  and 
propagate  it.  whereas  in  the  latter  scheme,  probe  messages  are  sent  by  any 
site  to  determine  the  status  of  another  site.  An  important  lesson  we  learned 
was  how  to  put  together  reports  from  the  neighbors  of  a  site  to  determine  its 
status.  Hie  problem  here  is  that  all  the  links  to  a  site  may  appear  dead  at 
different  times  to  its  neighbors,  but  the  site  itself  may  never  have  crashed  or 
noticed  that  it  was  partitioned  from  the  rest  of  the  network.  But  these 
status  reports  may  be  put  together  at  another  site  which  may  then  conclude 
that  the  site  has  crashed.  Rule  C3  would  then  be  violated.  The  natural 
approach  to  solving  this  problem  seemed  to  be  to  require  that  the  report 
timestamps  should  lie  within  a  small  time  window.  However,  difficulties  were 
encountered  in  ensuring  that  a  partitioned  site  observed  its  links  to  the  rest 
of  the  network  to  be  down  in  a  similar  time  window  and  crashed  itself  in  time 
to  comply  with  rule  C3.  The  solution  adopted  in  the  end  involved  putting 
together  the  latest  reports  from  all  neighbors  of  a  site  to  determine  if  the 
site  should  be  marked  DOWN .  The  problem  mentioned  above  was  solved  by 
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having  a  site  mark  a  recovered  link  to  a  neighbor  as  UP  in  its  CRAS H_$ELF 
graph  only  after  all  other  sites  have  marked  it  UP  in  their  CRASH_£)THERS 
graphs. 

A  promising  extension  of  our  approach  is  in  the  direction  of  dynamic 
networks.  Such  networks  will  typically  consist  of  non-overlapping  clusters  of 
sites,  each  cluster  functioning  more  or  less  independently  of  other  clusters. 
Sites  may  migrate  from  one  cluster  to  another.  In  order  to  maintain  the 
consistency  of  information  concerning  membership  of  sites  in  clusters,  the 
following  modification  of  rule  C3  seems  appropriate: 

C3:  If  site  x  in  cluster  C  does  not  have  site  y  included  in  its  list  of  sites  in  C 
at  time  t,  then  site  y  does  not  consider  itself  to  be  a  member  of  C  at  time  t . 

Our  approach  to  realizing  rule  C3  in  static  networks,  described  in 
Chapter  2.  suggests  how  C3'  may  be  realized  in  a  dynamic  network.  Site  x 
would  remove  y  from  its  list  of  sites  in  C  only  when  it  finds  that  sites  in  C  are 
unable  to  communicate  with  y.  Site  y  finding  that  it  has  lost  contact  with 
sites  in  C  would  consider  its  membership  in  C  to  have  lapsed.  It  would  cease 
to  carry  out  the  actions  it  was  carrying  out  as  a  member  of  C,  and  institute 
appropriate  recovery  actions,  e.g.,  reapplying  for  membership  in  C,  becom¬ 
ing  a  member  of  another  cluster,  etc. 

In  searching  for  a  control  problem  to  test  out  our  network  status 
maintenance  scheme,  we  found  that  many  problems  become  simple  to  solve 
using  the  scheme.  An  example  is  the  election  protocol  of  [GAR  82].  The 
problem  here  is  to  choose  a  unique  co-ordinator  for  a  group  of  sites,  when 
the  current  co-ordinator  crashes.  At  all  times  it  is  desired  to  have  the  site 
with  highest  id  which  is  UP  as  co-ordinator.  The  solution  given  is  to  make 
the  election  of  a  co-ordinator  atomic  by  using  a  2-phase  protocol  to  broad- 
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cast  the  id  of  the  new  co-ordinator,  similar  to  the  2-phase  protocol  in  distri¬ 
buted  database  systems  [GRA  78].  The  proof  that  the  protocol  works 
correctly  is  not  simple.  But  using  our  network  status  maintenance  scheme, 
the  solution  is  quite  straightforward:  a  site  simply  considers  the  site  with 
highest  id  marked  VP  in  its  CRA5H_PTHERS  graph  as  the  current  co¬ 
ordinator.  Rule  C3  then  ensures  that  no  two  sites  consider  themselves  co¬ 
ordinator  at  the  same  time. 

We  developed  a  solution  to  the  replicated  file  update  problem  in  Chapter 
2.  Without  the  use  of  the  clock  facility,  it  would  be  difficult  to  ensure  that 
two  (  or  more  )  different  WARM  sites  do  not  join  the  set  of  HOT  sites  when  a 
HOT  site  crashes.  If  two  different  WARM  sites  do  join,  it  would  be  necessary 
either  to  force  one  and  exactly  one  of  them  to  quit  the  set  subsequently,  or 
else  to  incur  the  overhead  of  updating  an  extra  site  before  a  'done'  signal  is 
returned  to  the  originator  of  an  update  transaction. 

In  Chapter  3.  we  addressed  the  problem  of  preventing  error  propagation 
in  global  information  due  to  malfunctioning  sites.  We  found  that  a  more  gen¬ 
eral  form  of  the  Byzantine  Generals  Agreement  was  required  and  formulated 
it.  The  notion  of  different  kinds  of  malfunction  -tolerance  -specification 
was  introduced  as  a  way  to  trade  off  tolerance  to  malfunctions  against  the 
costs  involved.  There  are  still  many  areas  where  knowledge  of  ways  to  pro¬ 
vide  robustness  against  malfunctions  is  inadequate.  These  include  synchron¬ 
ization,  security,  efficient  transfer  of  bulk  data,  update  interactions  involving 
more  than  one  updated  variable,  etc.  A  prototype  for  testing  out  BGA  algo¬ 
rithms  is  currently  under  construction  at  the  San  Jose  IBM  Research  Labora¬ 
tory  [STR  82]  and  experience  in  this  project  should  contribute  in  this  direc¬ 


tion. 


'T’MW  W IWJ* 


-w»  V~*  ’_^  ."V"'. 


f:: 

f 


194 

In  Chapter  4,  we  addressed  deadlock  detection  in  distributed  databases. 
Algorithms  for  centralized  and  distributed  detection  were  proposed.  For 
centralized  detection  it  was  shown  that  2-phase  usage  of  resources  simplifies 
the  problem  of  dealing  with  race  conditions.  An  efficient  centralized  algo¬ 
rithm  was  proposed  for  the  ’migrating’  transaction  model.  A  distributed 
algorithm  using  both  backward  and  forward  propagation  to  hasten  detection 
was  described.  In  both  algorithms,  a  clock  facility  was  the  means  whereby 
the  consistency  of  'snapshots'  of  system  status  was  ensured. 
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